第二个CUDA程序
2012-02-28 16:14
337 查看
CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.
CUDA C扩展了C语言通过允许程序员定义C函数,称作kernels(核心程序)。当核心程序调用的时候,可以通过N个不同的CUDA 线程并行执行N次,与通常的C函数只能执行一次截然相反
A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax (see Appendix B.17).
Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
核心程序通过使用__global__ 声明标识符定义。
Any call to a __global__ function must specify the execution configuration for that call.
The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device, as well as the associated stream
(see Section 3.2.5.5 for a description of streams).
类型:dim3
This type is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.
The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:
注:Ns和S为可选参数默认为0
Dg is of type dim3 (see Section B.3.2) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched; Dg.z must be equal to 1 for devices of compute capability 1.x;
Db is of type dim3 (see Section B.3.2) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an
external array as mentioned in Section B.2.3; Ns is an optional argument which defaults to 0;
S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.
As an example, a function declared as __global__ void Func(float* parameter);
must be called like this: Func<<< Dg, Db, Ns >>>(parameter);
The arguments to the execution configuration are evaluated before the actual function arguments and like the function arguments, are currently passed via shared memory to the device.
The function call will fail if Dg or Db are greater than the maximum sizes allowed for the device as specified in Appendix F, or if Ns is greater than the maximum amount of shared memory available on the device, minus the amount of shared memory required for
static allocation, functions arguments (for devices of compute capability 1.x), and execution configuration
程序引用自:http://tech.it168.com/a2011/0705/1213/000001213209_all.shtml
CUDA C扩展了C语言通过允许程序员定义C函数,称作kernels(核心程序)。当核心程序调用的时候,可以通过N个不同的CUDA 线程并行执行N次,与通常的C函数只能执行一次截然相反
A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax (see Appendix B.17).
Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
核心程序通过使用__global__ 声明标识符定义。
Any call to a __global__ function must specify the execution configuration for that call.
The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device, as well as the associated stream
(see Section 3.2.5.5 for a description of streams).
类型:dim3
This type is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.
The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:
注:Ns和S为可选参数默认为0
Dg is of type dim3 (see Section B.3.2) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched; Dg.z must be equal to 1 for devices of compute capability 1.x;
Db is of type dim3 (see Section B.3.2) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an
external array as mentioned in Section B.2.3; Ns is an optional argument which defaults to 0;
S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.
As an example, a function declared as __global__ void Func(float* parameter);
must be called like this: Func<<< Dg, Db, Ns >>>(parameter);
The arguments to the execution configuration are evaluated before the actual function arguments and like the function arguments, are currently passed via shared memory to the device.
The function call will fail if Dg or Db are greater than the maximum sizes allowed for the device as specified in Appendix F, or if Ns is greater than the maximum amount of shared memory available on the device, minus the amount of shared memory required for
static allocation, functions arguments (for devices of compute capability 1.x), and execution configuration
程序引用自:http://tech.it168.com/a2011/0705/1213/000001213209_all.shtml
#include<stdio.h> #include <stdlib.h> #define DATA_SIZE 1048576 int data[DATA_SIZE]; void GenerateNumbers(int *number,int size) { for(int i=0;i<size;i++) number[i]=rand()%10; } //定义kernel程序 __global__ static void sumOfSquares(int *num,int *result) { int sum=0; int i; for(i=0;i<DATA_SIZE;i++) { sum+=num[i]*num[i]; } *result=sum; } int main() { GenerateNumbers(data, DATA_SIZE); int *gpudata, *result; cudaMalloc((void**)&gpudata,sizeof(int)*DATA_SIZE); cudaMalloc((void**)&result,sizeof(int)); //把数据从cpu中拷贝到GPU中进行运算 cudaMemcpy(gpudata,data,sizeof(int) * DATA_SIZE,cudaMemcpyHostToDevice); //调用kernel程序 sumOfSquares<<<1,1>>>(gpudata, result); int sum; //把结果从GPU中拷贝到CPU中 cudaMemcpy(&sum, result, sizeof(int), cudaMemcpyDeviceToHost); cudaFree(gpudata); cudaFree(result); printf("sum(GPU): %d\n", sum); sum = 0; for(int i = 0; i < DATA_SIZE; i++) { sum += data[i] * data[i]; } printf("sum (CPU): %d\n", sum); //for test scanf("%d",&sum); }
相关文章推荐
- cuda第二个程序
- 第二个cuda程序——图像拉伸
- 深入浅出谈CUDA-[第五章][ 第二个CUDA程序]
- cuda入门——第二个 CUDA程序
- Visual studio C++ project下配置cuda程序
- 第二章 第二个spring-boot程序
- python学习:第二个小程序,分数查询系统,巩固数组、while,for循环~~
- win7系统下VS2015中CUDA8.0调试程序问题
- CUDA程序如何解决 warning C4819?
- [随笔] 交付第二个Bada程序之后
- 零基础学python-1.7 第二个程序 猜数字小游戏
- 第二个程序:用pymssql连接mssql数据库
- 第二个haskell程序:带命令行参数的Hello World
- CUDA程序优化小记(八)
- 我们认为2是第一个素数,3是第二个素数,5是第三个素数,依次类推。 现在,给定两个整数n和m,0<n<=m<=200,你的程序要计算第n个素数到第m个素数之间所有的素数的和,包括第n个素数和第m个素数
- 第一个基于GPU CUDA的程序--向量相加的并行处理
- GPU编程自学3 —— CUDA程序初探
- CUDA程序性能分析-矩阵乘法
- 我写的cuda程序
- 第二个shell程序------字符串变量操作