CUDA by Example 第三章 部分翻译实践 GPU器件参数提取
2013-10-09 15:48
555 查看
由于这本书内容实在是多,很多内容和其他讲解cuda的书又重复了,所以我只翻译一些重点,时间就是金钱嘛,一起来学cuda吧。如有错误,欢迎纠正。由于第一章第二章暂时没时间仔细看,我们从第三章开始。不喜欢受制于人,所以不用它的头文件,所有程序我都会改写,有些程序实在是太无聊,就算了。
执行命令:./hello并没有在cuda上面执行任何任务。
第二个程序
这一章的重点(对我来说)是3.3 访问GPU(device)这章呢,是说,如果你没有你所用的GPU的说明书,或者你懒得拆解下来看,或者,为了让你的程序可以适用于更多不同的硬件环境,尝试用编程的方式来得到关于GPU的某些参数。大量的废话大家自己看吧。俺讲写有意义的。现在很多电脑里面都不只有一个GPU显卡,尤其是显卡做计算的集成环境,所以我们可以通过。
然后通过cudaDeviceProp这个结构提可以获得显卡的相关性能。下面是以cuda3.0为例子.定义的这个机构体在自己的程序中可以直接调用,无需自己定义。
The count of CUDA devices:1
---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled
---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ ./cuda
-bash: ./cuda: 沒有此一檔案或目錄
yue@ubuntu-10:~/cuda/cudabye$ ./cudabyex331
The count of CUDA devices:1
---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled
---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
示例机器2:The count of CUDA devices:2
---General Information for device 0---
Name of the cuda device: Tesla K20c
Compute capability: 3.5
Clock rate: 705500
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Disabled
---Memory Information for device 0 ---
Total global mem in bytes: 5032706048
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 0---
Multiprocessor count: 13
Shared mem per mp(block): 49152
Registers per mp(block):65536
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(2147483647,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65536)
The maximum dimensions for 3D textures:(4096,4096,4096)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
---General Information for device 1---
Name of the cuda device: GeForce GTX 480
Compute capability: 2.0
Clock rate: 1401000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled
---Memory Information for device 1 ---
Total global mem in bytes: 1610153984
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 1---
Multiprocessor count: 15
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
参考书籍:《CUDA BY EXAMPLE》
http://blog.csdn.net/fulva/article/details/8757089
//hello.cu #include<stdio.h> #include<cuda.h> int main( void ) { printf( "Hello, World!\n" ); return 0; }这第一个cuda程序并不能算是严格的cuda程序,它只不过用到了cuda的头文件,编译命令: nvcc hello.cu -o hello
执行命令:./hello并没有在cuda上面执行任何任务。
第二个程序
#include<stdio.h> #include<cuda.h> __global__ void kernel(void){} int main( void ) { kernel<<<1,1>>>(); printf( "Hello, World!\n" ); return 0; }这个程序调用了一个函数,__global__的含义是该函数在CPU上调用,GPU上执行。至于三个尖括号里面的参数是什么呢? 要看下一章
#include <stdio.h> #include <cuda.h> __global__ void add( int a, int b, int *c ) { *c = a + b; } int main( void ) { int c; int *dev_c; cudaMalloc( (void**)&dev_c, sizeof(int) ); add<<<1,1>>>( 2, 7, dev_c ); cudaMemcpy( &c,dev_c,sizeof(int),cudaMemcpyDeviceToHost ); printf( "2 + 7 = %d\n", c ); cudaFree( dev_c ); return 0; }cudaMalloc()分配GPU上的存储空间,cudaMemcpy是把运行结果从GPU上拷贝到CPU上cudaMemcpyDeviceToHost,或者把执行参数从CPU上拷贝到GPU上cudaMemcpyHostToDevice。cudaFree是释放GPU上的空间,和CPU上的Free是同样的意义,只不过对象不同。
这一章的重点(对我来说)是3.3 访问GPU(device)这章呢,是说,如果你没有你所用的GPU的说明书,或者你懒得拆解下来看,或者,为了让你的程序可以适用于更多不同的硬件环境,尝试用编程的方式来得到关于GPU的某些参数。大量的废话大家自己看吧。俺讲写有意义的。现在很多电脑里面都不只有一个GPU显卡,尤其是显卡做计算的集成环境,所以我们可以通过。
int count; cudaGetDeviceCount(&count);来获得集成环境的显卡数量。
然后通过cudaDeviceProp这个结构提可以获得显卡的相关性能。下面是以cuda3.0为例子.定义的这个机构体在自己的程序中可以直接调用,无需自己定义。
struct cudaDeviceProp { char name[256]; //器件的名字 size_t totalGlobalMem; //Global Memory 的byte大小 size_t sharedMemPerBlock; //线程块可以使用的共用记忆体的最大值。byte为单位,多处理器上的所有线程块可以同时共用这些记忆体 int regsPerBlock; //线程块可以使用的32位寄存器的最大值,多处理器上的所有线程快可以同时实用这些寄存器 int warpSize; //按线程计算的wrap块大小 size_t memPitch; //做内存复制是可以容许的最大间距,允许通过cudaMallocPitch()为包含记忆体区域的记忆提复制函数的最大间距,以byte为单位。 int maxThreadsPerBlock; //每个块中最大线程数 int maxThreadsDim[3]; //块各维度的最大值 int maxGridSize[3]; //Grid各维度的最大值 size_t totalConstMem; //常量内存的大小 int major; //计算能力的主代号 int minor; //计算能力的次要代号 int clockRate; //时钟频率 size_t textureAlignment; //纹理的对齐要求 int deviceOverlap; //器件是否能同时执行cudaMemcpy()和器件的核心代码 int multiProcessorCount; //设备上多处理器的数量 int kernelExecTimeoutEnabled; //是否可以给核心代码的执行时间设置限制 int integrated; //这个GPU是否是集成的 int canMapHostMemory; //这个GPU是否可以讲主CPU上的存储映射到GPU器件的地址空间 int computeMode; //计算模式 int maxTexture1D; //一维Textures的最大维度 int maxTexture2D[2]; //二维Textures的最大维度 int maxTexture3D[3]; //三维Textures的最大维度 int maxTexture2DArray[3]; //二维Textures阵列的最大维度 int concurrentKernels; //GPU是否支持同时执行多个核心程序 }实例程序:
#include<stdio.h> #include<stdlib.h> #include<cuda.h> int main() { int i; /*cudaGetDeviceCount(&count)*/ int count; cudaGetDeviceCount(&count); printf("The count of CUDA devices:%d\n",count); //// cudaDeviceProp prop; for(i=0;i<count;i++) { cudaGetDeviceProperties(&prop,i); printf("\n---General Information for device %d---\n",i); printf("Name of the cuda device: %s\n",prop.name); printf("Compute capability: %d.%d\n",prop.major,prop.minor); printf("Clock rate: %d\n",prop.clockRate); printf("Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): "); if(prop.deviceOverlap) printf("Enabled\n"); else printf("Disabled\n"); printf("Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): "); if(prop.kernelExecTimeoutEnabled) printf("Enabled\n"); else printf("Disabled\n"); printf("\n---Memory Information for device %d ---\n",i); printf("Total global mem in bytes: %ld\n",prop.totalGlobalMem); printf("Total constant Mem: %ld\n",prop.totalConstMem); printf("Max mem pitch for memory copies in bytes: %ld\n",prop.memPitch); printf("Texture Alignment: %ld\n",prop.textureAlignment); printf("\n---MP Information for device %d---\n",i); printf("Multiprocessor count: %d\n",prop.multiProcessorCount); printf("Shared mem per mp(block): %ld\n",prop.sharedMemPerBlock); printf("Registers per mp(block):%d\n",prop.regsPerBlock); printf("Threads in warp:%d\n",prop.warpSize); printf("Max threads per block: %d\n",prop.maxThreadsPerBlock); printf("Max thread dimensions in a block:(%d,%d,%d)\n",prop.maxThreadsDim[0],prop.maxThreadsDim[1],prop.maxThreadsDim[2]); printf("Max blocks dimensions in a grid:(%d,%d,%d)\n",prop.maxGridSize[0],prop.maxGridSize[1],prop.maxGridSize[2]); printf("\n"); printf("\nIs the device an integrated GPU:"); if(prop.integrated) printf("Yes!\n"); else printf("No!\n"); printf("Whether the device can map host memory into CUDA device address space:"); if(prop.canMapHostMemory) printf("Yes!\n"); else printf("No!\n"); printf("Device's computing mode:%d\n",prop.computeMode); printf("\n The maximum size for 1D textures:%d\n",prop.maxTexture1D); printf("The maximum dimensions for 2D textures:(%d,%d)\n",prop.maxTexture2D[0],prop.maxTexture2D[1]); printf("The maximum dimensions for 3D textures:(%d,%d,%d)\n",prop.maxTexture3D[0],prop.maxTexture3D[1],prop.maxTexture3D[2]); // printf("The maximum dimensions for 2D texture arrays:(%d,%d,%d)\n",prop.maxTexture2DArray[0],prop.maxTexture2DArray[1],prop.maxTexture2DArray[2]); printf("Whether the device supports executing multiple kernels within the same context simultaneously:"); if(prop.concurrentKernels) printf("Yes!\n"); else printf("No!\n"); } }运行结果:
The count of CUDA devices:1
---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled
---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ ./cuda
-bash: ./cuda: 沒有此一檔案或目錄
yue@ubuntu-10:~/cuda/cudabye$ ./cudabyex331
The count of CUDA devices:1
---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled
---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
示例机器2:The count of CUDA devices:2
---General Information for device 0---
Name of the cuda device: Tesla K20c
Compute capability: 3.5
Clock rate: 705500
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Disabled
---Memory Information for device 0 ---
Total global mem in bytes: 5032706048
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 0---
Multiprocessor count: 13
Shared mem per mp(block): 49152
Registers per mp(block):65536
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(2147483647,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65536)
The maximum dimensions for 3D textures:(4096,4096,4096)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
---General Information for device 1---
Name of the cuda device: GeForce GTX 480
Compute capability: 2.0
Clock rate: 1401000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled
---Memory Information for device 1 ---
Total global mem in bytes: 1610153984
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 1---
Multiprocessor count: 15
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
参考书籍:《CUDA BY EXAMPLE》
http://blog.csdn.net/fulva/article/details/8757089
相关文章推荐
- CUDA by Example 第三章 部分翻译实践 GPU器件参数提取
- 电子书下载:CUDA by Example: An Introduction to General-Purpose GPU Programming
- gpu cuda部分参数学习
- 把书《CUDA By Example an Introduction to General Purpose GPU Programming》读薄
- 翻译:Visual C# 4.0中的新特性-第一部分-可选参数(Optional parameters)
- GPU(CUDA)学习日记(十)------ Kernal 内核函数 参数的传递
- 利用Halcon提取出器件的中心部分
- 《cuda by example》 book.h 头文件解析(转)
- CUDA By Example CUDA实战学习
- cuda by example
- GPU-CUDA编程实践(一)
- Cuda by Example 配置
- 《Django By Example》第一章 中文 翻译 (个人学习,渣翻)
- 《Django By Example》第十二章(终章) 中文 翻译 (个人学习,渣翻)
- python 编程 入门到实践 第三章 列表部分 (课后题加原书)
- LearnNode 第三章 部分翻译 附 原文
- 根据标注区域提取需要部分的语音特征参数
- GPU编程自学4 —— CUDA核函数运行参数
- CUDA By Example——Julia实例
- 手工解析选项参数问题《Linux Programming by Example:The Fundamentals》chapter2,exercise