您的位置:首页 > 其它

第二个CUDA程序

2012-02-28 16:14 337 查看
CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

CUDA C扩展了C语言通过允许程序员定义C函数,称作kernels(核心程序)。当核心程序调用的时候,可以通过N个不同的CUDA 线程并行执行N次,与通常的C函数只能执行一次截然相反

A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax (see Appendix B.17).
Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

核心程序通过使用__global__ 声明标识符定义。

Any call to a __global__ function must specify the execution configuration for that call.
The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device, as well as the associated stream
(see Section 3.2.5.5 for a description of streams).

类型:dim3

This type is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.

The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:

注:Ns和S为可选参数默认为0

 Dg is of type dim3 (see Section B.3.2) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched; Dg.z must be equal to 1 for devices of compute capability 1.x;

 Db is of type dim3 (see Section B.3.2) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;

 Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an
external array as mentioned in Section B.2.3; Ns is an optional argument which defaults to 0;

 S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.

As an example, a function declared as __global__ void Func(float* parameter);

must be called like this: Func<<< Dg, Db, Ns >>>(parameter);

The arguments to the execution configuration are evaluated before the actual function arguments and like the function arguments, are currently passed via shared memory to the device.

The function call will fail if Dg or Db are greater than the maximum sizes allowed for the device as specified in Appendix F, or if Ns is greater than the maximum amount of shared memory available on the device, minus the amount of shared memory required for
static allocation, functions arguments (for devices of compute capability 1.x), and execution configuration

程序引用自:http://tech.it168.com/a2011/0705/1213/000001213209_all.shtml

#include<stdio.h>
#include <stdlib.h>
#define DATA_SIZE 1048576
int data[DATA_SIZE];

void GenerateNumbers(int *number,int size)
{
for(int i=0;i<size;i++)
number[i]=rand()%10;
}
//定义kernel程序
__global__ static void sumOfSquares(int *num,int *result)
{

int sum=0;
int i;
for(i=0;i<DATA_SIZE;i++)
{
sum+=num[i]*num[i];
}
*result=sum;
}
int main()
{
GenerateNumbers(data, DATA_SIZE);
int *gpudata, *result;
cudaMalloc((void**)&gpudata,sizeof(int)*DATA_SIZE);
cudaMalloc((void**)&result,sizeof(int));
//把数据从cpu中拷贝到GPU中进行运算
cudaMemcpy(gpudata,data,sizeof(int) * DATA_SIZE,cudaMemcpyHostToDevice);
//调用kernel程序
sumOfSquares<<<1,1>>>(gpudata, result);
int sum;
//把结果从GPU中拷贝到CPU中
cudaMemcpy(&sum, result, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(gpudata);
cudaFree(result);
printf("sum(GPU): %d\n", sum);
sum = 0;
for(int i = 0; i < DATA_SIZE; i++) {
sum += data[i] * data[i];
}
printf("sum (CPU): %d\n", sum);
//for test
scanf("%d",&sum);

}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: