您的位置:首页 > 其它

CUDA by Example 第三章 部分翻译实践 GPU器件参数提取

2013-10-09 15:48 555 查看
由于这本书内容实在是多,很多内容和其他讲解cuda的书又重复了,所以我只翻译一些重点,时间就是金钱嘛,一起来学cuda吧。如有错误,欢迎纠正。由于第一章第二章暂时没时间仔细看,我们从第三章开始。不喜欢受制于人,所以不用它的头文件,所有程序我都会改写,有些程序实在是太无聊,就算了。
//hello.cu
#include<stdio.h>
#include<cuda.h>
int main( void ) {
printf( "Hello, World!\n" );
return 0;
}
这第一个cuda程序并不能算是严格的cuda程序,它只不过用到了cuda的头文件,编译命令: nvcc hello.cu -o hello
执行命令:./hello并没有在cuda上面执行任何任务。
第二个程序
#include<stdio.h>
#include<cuda.h>
__global__ void kernel(void){}
int main( void ) {
kernel<<<1,1>>>();
printf( "Hello, World!\n" );
return 0;
}
这个程序调用了一个函数,__global__的含义是该函数在CPU上调用,GPU上执行。至于三个尖括号里面的参数是什么呢? 要看下一章
#include <stdio.h>
#include <cuda.h>
__global__ void add( int a, int b, int *c ) {
        *c = a + b;
}
int main( void ) 
{       
        int c;
        int *dev_c;
        cudaMalloc( (void**)&dev_c, sizeof(int) );
        add<<<1,1>>>( 2, 7, dev_c );
        cudaMemcpy( &c,dev_c,sizeof(int),cudaMemcpyDeviceToHost );
        printf( "2 + 7 = %d\n", c );
        cudaFree( dev_c );
        return 0; 
}
cudaMalloc()分配GPU上的存储空间,cudaMemcpy是把运行结果从GPU上拷贝到CPU上cudaMemcpyDeviceToHost,或者把执行参数从CPU上拷贝到GPU上cudaMemcpyHostToDevice。cudaFree是释放GPU上的空间,和CPU上的Free是同样的意义,只不过对象不同。
这一章的重点(对我来说)是3.3 访问GPU(device)这章呢,是说,如果你没有你所用的GPU的说明书,或者你懒得拆解下来看,或者,为了让你的程序可以适用于更多不同的硬件环境,尝试用编程的方式来得到关于GPU的某些参数。大量的废话大家自己看吧。俺讲写有意义的。现在很多电脑里面都不只有一个GPU显卡,尤其是显卡做计算的集成环境,所以我们可以通过。
int count;
cudaGetDeviceCount(&count);
来获得集成环境的显卡数量。
然后通过cudaDeviceProp这个结构提可以获得显卡的相关性能。下面是以cuda3.0为例子.定义的这个机构体在自己的程序中可以直接调用,无需自己定义。
struct cudaDeviceProp {
char name[256];           //器件的名字
size_t totalGlobalMem;    //Global Memory 的byte大小
size_t sharedMemPerBlock; //线程块可以使用的共用记忆体的最大值。byte为单位,多处理器上的所有线程块可以同时共用这些记忆体
int regsPerBlock;         //线程块可以使用的32位寄存器的最大值,多处理器上的所有线程快可以同时实用这些寄存器
int warpSize;             //按线程计算的wrap块大小
size_t memPitch;          //做内存复制是可以容许的最大间距,允许通过cudaMallocPitch()为包含记忆体区域的记忆提复制函数的最大间距,以byte为单位。
int maxThreadsPerBlock;   //每个块中最大线程数
int maxThreadsDim[3];     //块各维度的最大值
int maxGridSize[3];       //Grid各维度的最大值
size_t totalConstMem;     //常量内存的大小
int major;                //计算能力的主代号
int minor;                //计算能力的次要代号
int clockRate;            //时钟频率
size_t textureAlignment;  //纹理的对齐要求
int deviceOverlap;        //器件是否能同时执行cudaMemcpy()和器件的核心代码
int multiProcessorCount;  //设备上多处理器的数量
int kernelExecTimeoutEnabled; //是否可以给核心代码的执行时间设置限制
int integrated;           //这个GPU是否是集成的
int canMapHostMemory;     //这个GPU是否可以讲主CPU上的存储映射到GPU器件的地址空间
int computeMode;          //计算模式
int maxTexture1D;         //一维Textures的最大维度  
int maxTexture2D[2];      //二维Textures的最大维度
int maxTexture3D[3];      //三维Textures的最大维度
int maxTexture2DArray[3]; //二维Textures阵列的最大维度
int concurrentKernels;    //GPU是否支持同时执行多个核心程序
}
实例程序:
#include<stdio.h>
#include<stdlib.h>
#include<cuda.h>

int main()
{
    int i;
    /*cudaGetDeviceCount(&count)*/
    int count;
    cudaGetDeviceCount(&count);
    printf("The count of CUDA devices:%d\n",count);
        ////

    cudaDeviceProp prop;
    for(i=0;i<count;i++)
    {
        cudaGetDeviceProperties(&prop,i);
        printf("\n---General Information for device %d---\n",i);
        printf("Name of the cuda device: %s\n",prop.name);
        printf("Compute capability: %d.%d\n",prop.major,prop.minor);
        printf("Clock rate: %d\n",prop.clockRate);
        printf("Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution):  ");
        if(prop.deviceOverlap)
            printf("Enabled\n");
        else
            printf("Disabled\n");
        printf("Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): ");
        if(prop.kernelExecTimeoutEnabled)
            printf("Enabled\n");
        else
            printf("Disabled\n");

        printf("\n---Memory Information for device %d ---\n",i);
        printf("Total global mem in bytes: %ld\n",prop.totalGlobalMem);
        printf("Total constant Mem: %ld\n",prop.totalConstMem);
        printf("Max mem pitch for memory copies in bytes: %ld\n",prop.memPitch);
        printf("Texture Alignment: %ld\n",prop.textureAlignment);

        printf("\n---MP Information for device %d---\n",i);
        printf("Multiprocessor count: %d\n",prop.multiProcessorCount);
        printf("Shared mem per mp(block): %ld\n",prop.sharedMemPerBlock);
        printf("Registers per mp(block):%d\n",prop.regsPerBlock);
        printf("Threads in warp:%d\n",prop.warpSize);
        printf("Max threads per block: %d\n",prop.maxThreadsPerBlock);
        printf("Max thread dimensions in a block:(%d,%d,%d)\n",prop.maxThreadsDim[0],prop.maxThreadsDim[1],prop.maxThreadsDim[2]);
        printf("Max blocks dimensions in a grid:(%d,%d,%d)\n",prop.maxGridSize[0],prop.maxGridSize[1],prop.maxGridSize[2]);
        printf("\n");

        printf("\nIs the device an integrated GPU:");
        if(prop.integrated)
            printf("Yes!\n");
        else
            printf("No!\n");

        printf("Whether the device can map host memory into CUDA device address space:");
        if(prop.canMapHostMemory)
            printf("Yes!\n");
        else
            printf("No!\n");

        printf("Device's computing mode:%d\n",prop.computeMode);

        printf("\n The maximum size for 1D textures:%d\n",prop.maxTexture1D);
        printf("The maximum dimensions for 2D textures:(%d,%d)\n",prop.maxTexture2D[0],prop.maxTexture2D[1]);
        printf("The maximum dimensions for 3D textures:(%d,%d,%d)\n",prop.maxTexture3D[0],prop.maxTexture3D[1],prop.maxTexture3D[2]);
//      printf("The maximum dimensions for 2D texture arrays:(%d,%d,%d)\n",prop.maxTexture2DArray[0],prop.maxTexture2DArray[1],prop.maxTexture2DArray[2]);

        printf("Whether the device supports executing multiple kernels within the same context simultaneously:");
        if(prop.concurrentKernels)
            printf("Yes!\n");
        else
            printf("No!\n");
    }
}
运行结果:
The count of CUDA devices:1

---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)

Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ ./cuda
-bash: ./cuda: 沒有此一檔案或目錄
yue@ubuntu-10:~/cuda/cudabye$ ./cudabyex331
The count of CUDA devices:1

---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)

Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
示例机器2:The count of CUDA devices:2

---General Information for device 0---
Name of the cuda device: Tesla K20c
Compute capability: 3.5
Clock rate: 705500
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Disabled

---Memory Information for device 0 ---
Total global mem in bytes: 5032706048
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 0---
Multiprocessor count: 13
Shared mem per mp(block): 49152
Registers per mp(block):65536
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(2147483647,65535,65535)

Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65536)
The maximum dimensions for 3D textures:(4096,4096,4096)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!

---General Information for device 1---
Name of the cuda device: GeForce GTX 480
Compute capability: 2.0
Clock rate: 1401000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

---Memory Information for device 1 ---
Total global mem in bytes: 1610153984
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 1---
Multiprocessor count: 15
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)

Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!

参考书籍:《CUDA BY EXAMPLE》

http://blog.csdn.net/fulva/article/details/8757089
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: