CUDA ---- Memory Access
2015-06-13 15:21
288 查看
Memory Access Patterns
大部分device一开始从global Memory获取数据,而且,大部分GPU应用表现会被带宽限制。因此最大化应用对global Memory带宽的使用时获取高性能的第一步。也就是说,global Memory的使用就没调节好,其它的优化方案也获取不到什么大效果,下面的内容会涉及到不少L1的知识,这部分了解下就好,L1在Maxwell之后就不用了,但是cache的知识点是不变的。Aligned and Coalesced Access
如下图所示,global Memory的load/store要经由cache,所有的数据会初始化在DRAM,也就是物理的device Memory上,而kernel能够获取的global Memory实际上是一块逻辑内存空间。Kernel对Memory的请求都是由DRAM和SM的片上内存以128-byte和32-byte传输解决的。int main(int argc, char **argv) { // set up device int dev = 0; cudaDeviceProp deviceProp; cudaGetDeviceProperties(&deviceProp, dev); printf("%s test struct of array at ", argv[0]); printf("device %d: %s \n", dev, deviceProp.name); cudaSetDevice(dev); // allocate host memory int nElem = LEN; size_t nBytes = nElem * sizeof(innerStruct); innerStruct *h_A = (innerStruct *)malloc(nBytes); innerStruct *hostRef = (innerStruct *)malloc(nBytes); innerStruct *gpuRef = (innerStruct *)malloc(nBytes); // initialize host array initialInnerStruct(h_A, nElem); testInnerStructHost(h_A, hostRef,nElem); // allocate device memory innerStruct *d_A,*d_C; cudaMalloc((innerStruct**)&d_A, nBytes); cudaMalloc((innerStruct**)&d_C, nBytes); // copy data from host to device cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice); // set up offset for summary int blocksize = 128; if (argc>1) blocksize = atoi(argv[1]); // execution configuration dim3 block (blocksize,1); dim3 grid ((nElem+block.x-1)/block.x,1); // kernel 1: warmup double iStart = seconds(); warmup <<< grid, block >>> (d_A, d_C, nElem); cudaDeviceSynchronize(); double iElaps = seconds() - iStart; printf("warmup <<< %3d, %3d >>> elapsed %f sec\n",grid.x, block.x,iElaps); cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost); checkInnerStruct(hostRef, gpuRef, nElem); // kernel 2: testInnerStruct iStart = seconds(); testInnerStruct <<< grid, block >>> (d_A, d_C, nElem); cudaDeviceSynchronize(); iElaps = seconds() - iStart; printf("innerstruct <<< %3d, %3d >>> elapsed %f sec\n",grid.x, block.x,iElaps); cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost); checkInnerStruct(hostRef, gpuRef, nElem); // free memories both host and device cudaFree(d_A); cudaFree(d_C); free(h_A); free(hostRef); free(gpuRef); // reset device cudaDeviceReset(); return EXIT_SUCCESS; }
View Code
编译运行(Fermi M2070):
$ nvcc -O3 -arch=sm_20 simpleMathAoS.cu -o simpleMathAoS $ ./simpleMathAoS innerStruct <<< 8192, 128 >>> elapsed 0.000286 sec
查看load和store性能:
$ nvprof --devices 0 --metrics gld_efficiency,gst_efficiency ./simpleMathAoS gld_efficiency 50.00% gst_efficiency 50.00%
正如预期那样,都只达到了一般,因为额外那部分消耗都用来load/store 另一个元素了,而这部分不是我们需要的。
Example:Simple Math with the SoA Data Layout
__global__ void testInnerArray(InnerArray *data,InnerArray *result, const int n) { unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; if (i<n) { float tmpx = data->x[i]; float tmpy = data->y[i]; tmpx += 10.f; tmpy += 20.f; result->x[i] = tmpx; result->y[i] = tmpy; } }
分配global Memory:
int nElem = LEN; size_t nBytes = sizeof(InnerArray); InnerArray *d_A,*d_C; cudaMalloc((InnerArray **)&d_A, nBytes); cudaMalloc((InnerArray **)&d_C, nBytes);
编译运行:
$ nvcc -O3 -arch=sm_20 simpleMathSoA.cu -o simpleSoA $ ./simpleSoA innerArray <<< 8192, 128 >>> elapsed 0.000200 sec
查看load/store性能:
$ nvprof --devices 0 --metrics gld_efficiency,gst_efficiency ./simpleMathSoA gld_efficiency 100.00% gst_efficiency 100.00%
Performance Tuning
调节device Memory带宽利用性能时,主要是力求达到下面两个目标:Aligned and Coalesced Memory accesses that reduce wasted bandwidth
Sufficient concurrent Memory operations to hide Memory latency
Unrolling Techniques
展开循环可以增加更多的独立的Memory操作,我们在之前博文有详细介绍如何展开loop,考虑之前的redSegment的例子,我们修改下readOffset来使每个thread执行四个独立Memory操作,就像下面那样:__global__ void readOffsetUnroll4(float *A, float *B, float *C,const int n, int offset) { unsigned int i = blockIdx.x * blockDim.x * 4 + threadIdx.x; unsigned int k = i + offset; if (k + 3 * blockDim.x < n) { C[i] = A[k] C[i + blockDim.x] = A[k + blockDim.x] + B[k + blockDim.x]; C[i + 2 * blockDim.x] = A[k + 2 * blockDim.x] + B[k + 2 * blockDim.x]; C[i + 3 * blockDim.x] = A[k + 3 * blockDim.x] + B[k + 3 * blockDim.x]; } }
编译运行(可能需要使用-Xptxas -dlcm=ca来启用L1):
$ ./readSegmentUnroll 0 warmup <<< 32768, 512 >>> offset 0 elapsed 0.001990 sec unroll4 <<< 8192, 512 >>> offset 0 elapsed 0.000599 sec $ ./readSegmentUnroll 11 warmup <<< 32768, 512 >>> offset 11 elapsed 0.002114 sec unroll4 <<< 8192, 512 >>> offset 11 elapsed 0.000615 sec $ ./readSegmentUnroll 128 warmup <<< 32768, 512 >>> offset 128 elapsed 0.001989 sec unroll4 <<< 8192, 512 >>> offset 128 elapsed 0.000598 sec
我们看到,unrolling技术会对性能有巨大影响,比地址对齐影响还大。对于这类I/O-bound的kernel,提高内存获取的并行性对性能提升的影响,有更高的优先级。不过,我们应该看到,对齐的test比未对齐的test表现依然要好。
Unrolling并不能影响内存操作的总数目(只是影响并行的操作数目),我们可以查看下相关属性:
$ nvprof --devices 0 --metrics gld_efficiency,gst_efficiency ./readSegmentUnroll 11 readOffset gld_efficiency 49.69% readOffset gst_efficiency 100.00% readOffsetUnroll4 gld_efficiency 50.79% readOffsetUnroll4 gst_efficiency 100.00% $ nvprof --devices 0 --metrics gld_transactions,gst_transactions ./readSegmentUnroll 11 readOffset gld_transactions 132384 readOffset gst_transactions 32928 readOffsetUnroll4 gld_transactions 33152 readOffsetUnroll4 gst_transactions 8064
Exposing More Parallelism
这方面就是调整grid和block的配置,下面是加上unrolling后的结果:$ ./readSegmentUnroll 0 1024 22 unroll4 <<< 1024, 1024 >>> offset 0 elapsed 0.000169 sec $ ./readSegmentUnroll 0 512 22 unroll4 <<< 2048, 512 >>> offset 0 elapsed 0.000159 sec $ ./readSegmentUnroll 0 256 22 unroll4 <<< 4096, 256 >>> offset 0 elapsed 0.000157 sec $ ./readSegmentUnroll 0 128 22 unroll4 <<< 8192, 128 >>> offset 0 elapsed 0.000158 sec
表现最好的是block配置256 thread的kernel,虽然128thread会增加并行性,但是依然比256少那么一点点性能,这个主要是CC版本对应的资源限制决定的,以本代码为例,Fermi每个SM最多有8个block,每个SM能够并行的的warp是48个,当使用128个thread(per block)时,每个block中有4个warp,因为每个SM最多8个block能够同时运行,因此该kernel每个SM最多只能有32个warp,还有16个warp的计算性能没用上,所以性能差了就,可以使用Occupancy来验证下。
参考书:《professional cuda c programming》
相关文章推荐
- wsdl2h解析器和soapcpp2编译器选项介绍
- RedHat5安装gstreamer过程记录
- Maven最佳实践
- 按照鬼哥学so变化,四,第一章的例子
- unity3d easytouch教程
- Python中sqlite学习教程
- jsrender for 标签
- oltp与olap区别
- 十五天精通WCF——第四天 你一定要明白的通信单元Message
- c数据结构链式存储
- leetcode--BinaryTreeInorderTraversal
- 光流法(Optical Flow)
- Android即时聊天系统--随聊APP之接口定义
- intent-filter 之 data
- hdu 3037 Saving Beans(组合数学+lucas定理)
- IIS 7中 ISAPI 错误解决
- 如何通过JQuery将DIV的滚动条滚动到指定的位置
- [漏洞分析] AnimaGallery 2.6 - Local File Inclusion
- Markdown编辑器对比
- 如何用12864液晶显示图片和绘制任意函数图象(打点)