GPU 矩阵乘法(转)
2015-10-12 14:25
399 查看
MatrixmultiplicationinCUDA
Matrixmultiplicationisafundamentalbuildingblockforscientificcomputing.Moreover,thealgorithmicpatternsofmatrixmultiplicationarerepresentative.Manyotheralgorithmssharesimilaroptimizationtechniquesasmatrixmultiplication.Therefore,matrixmultiplicationisoneofthemostimportantexamplesinlearningparallelprogramming.Thispageprovidesastep-by-stepexampletooptimizematrixmultiplication.Theexplanationisgivenintheremainingpartofthispage.Youcandownloadthecodeandrunitfollowingtheinstructionbelow.Youcandownloadthesourcecode"cd~/examples/cuda/
wgethttp://www.es.ele.tue.nl/~mwijtvliet/5KK73/downloads/cuda.zip-Ocuda.zip
unzip./cuda.zip
cdcuda
make
./
matrixmul
Contents
ThematrixMulProblem
GivenanMxKmatrixAandaKxNmatrixB,multiplyAwithBandstoretheresultintoaMxNmatrixC.ThematrixMulexampleonthispagewillshowseveraltechniquestooptimizematrixmultiplicationonGPU.Mostofthemaregeneric,whichcanbeappliedtootherapplications.Thesetechniquesare:TilingMemorycoalescingAvoidingmemorybankconflictsComputationOptimization.LoopunrollingPrefetchingTheperformanceoftheseoptimizationtechniquesareshowinthefiguresbelow(clinkthefiguretoenlarge).Note:Theseoptimizations,whicharetunedforNVIDIA8800GTGPUatmatrixsizeof4096x4096,couldbesub-optimalforotherGPUsandothermatrixsizes.NaiveImplementationOnCPUs
voidmain(){
defineA,B,C
fori=0toMdo
forj=0toNdo
/*computeelementC(i,j)
*/
fork=0toKdo
C(i,j)<=C(i,j)+A(i,k)*B(k,j)
end
end
end
}Tosimplifytheexplanation,squarematrices(M==N==K)areusedintheillustrations.ThefigurebelowshowsthememoryfootprinttocomputeanelementC3,11(inred).ThiscanbeviewedastheinnerproductofonerowofA(inblue)andonecolumnofB(ingreen).
NaiveImplementationOnGPUs
/*CodesrunningonCPU*/
voidmain(){
defineA_cpu,B_cpu,C_cpuintheCPUmemory
defineA_gpu,B_gpu,C_gpuintheGPUmemory
memcopyA_cputoA_gpu
memcopyB_cputoB_gpu
dim3dimBlock(16,16)
dim3dimGrid(N/dimBlock.x,M/dimBlock.y)
matrixMul<<<dimGrid,dimBlock>>>(A_gpu,B_gpu,C_gpu,K)
memcopyC_gputoC_cpu
}
/*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
temp<=0
i<=blockIdx.y*blockDim.y+threadIdx.y
// RowiofmatrixC
j<=blockIdx.x*blockDim.x+threadIdx.x
//ColumnjofmatrixC
fork=0toK-1do
accu
<=accu
+A_gpu(i,k)*B_gpu(k,j)
end
C_gpu(i,j)<=accu
}AnaiveimplementationonGPUsassignsonethreadtocomputeoneelementofmatrixC.EachthreadloadsonerowofmatrixAandonecolumnofmatrixBfromglobalmemory,dotheinnerproduct,andstoretheresultbacktomatrixCintheglobalmemory.ThefigureshowsthememoryfootprintofonethreadonglobalmemorywherematrixA,B,andCarestored.
Increasing"Computatin-to-MemoryRatio"byTiling
Toincreasethe"computation-to-memoryratio",thetiledmatrixmultiplicationcanbeapplied.OnethreadblockcomputesonetileofmatrixC.Onethreadinthethreadblockcomputesoneelementofthetile.Thefigureshowsa32x32matrixdividedintofour16x16tiles.Tocomputethis,fourthreadblockseachwith16x16threadscanbecreated./*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
__shared__floatA_tile(blockDim.y,blockDim.x)
__shared__floatB_tile(blockDim.x,blockDim.y)
accu
<=0
/*AccumulateCtilebytile.
*/
fortileIdx=0to(K/blockDim.x-1)do
/*Load
onetileofAandonetileofBintosharedmem*/
//RowiofmatrixA
i<=
blockIdx.y
*blockDim.y+threadIdx.y
//ColumnjofmatrixA
j<=
tileIdx
*blockDim.x+threadIdx.x
//LoadA(i,j)tosharedmem
A_tile(threadIdx.y,threadIdx.x)
<=A_gpu(i,j)
//LoadB(j,i)tosharedmem
B_tile(threadIdx.x,threadIdx.y)
<=
B_gpu(j,i)
//GlobalMemNotcoalesced
//Synchronizebeforecomputation
__sync()
/*AccumulateonetileofC
fromtilesofAandBinsharedmem*/
fork=0tothreadDim.xdo
//AccumulateformatrixC
accu
<=accu
+A_tile(threadIdx.y,k)*B_tile(k,threadIdx.x)
end
//
Synchronize
__sync()
end
//RowiofmatrixC
i<=blockIdx.y*blockDim.y+threadIdx.y
//ColumnjofmatrixC
j<=blockIdx.x*blockDim.x+threadIdx.x
//StoreaccumulatedvaluetoC(i,j)
C_gpu(i,j)
<=accu
}Inthetiledimplementation,theamountofcomputationisstill2xMxNxKflop.However,usingtilesizeofB,theamountofglobalmemoryaccessis2xMxNxK/Bword.The"computation-to-memoryratio"isapproximatelyB/4(flop/byte).Wenowcantunethe"computation-to-memory"ratiobychangingthetilesizeB.
GlobalMemoryCoalescing
TwodimensionalarraysinC/C++arerow-major.Inthetiledimplementationabove,neighbouringthreadshavecoalescedaccesstomatrixA,butdonothavecoalescedaccesstomatrixB.Incolumn-majorlanguages,suchasFortran,theproblemistheotherwayaround.AnobvioussolutionistotransposematrixBbyCPUbeforeoffloadingittoGPUmemory./*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
__shared__floatA_tile(blockDim.y,blockDim.x)
__shared__floatB_tile(blockDim.x,blockDim.y)
accu
<=0
/*AccumulateCtilebytile.
*/
fortileIdx=0to(K/blockDim.x-1)do
/*Load
onetileofAandonetileofBintosharedmem*/
//RowiofmatrixA
i<=
blockIdx.y
*blockDim.y+threadIdx.y
//ColumnjofmatrixA
j<=
tileIdx
*blockDim.x+threadIdx.x
//LoadA(i,j)tosharedmem
A_tile(threadIdx.y,threadIdx.x)
<=A_gpu(i,j)
//LoadB(i,j)tosharedmem
B_tile(threadIdx.x,threadIdx.y)
<=
B_gpu(i,j)
//GlobalMemCoalesced
//Synchronizebeforecomputation
__sync()
/*AccumulateonetileofC
fromtilesofAandBinsharedmem*/
fork=0tothreadDim.xdo
//AccumulateformatrixC
//SharedMemBankconflict
accu
<=accu
+A_tile(threadIdx.y,k)*
B_tile(threadIdx.x,k)
end
//
Synchronize
__sync()
end
//RowiofmatrixC
i<=blockIdx.y*blockDim.y+threadIdx.y
//ColumnjofmatrixC
j<=blockIdx.x*blockDim.x+threadIdx.x
//StoreaccumulatedvaluetoC(i,j)
C_gpu(i,j)
<=accu
}
AvoidingSharedMemoryBankConflict
/*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
__shared__floatA_tile(blockDim.y,blockDim.x)
__shared__floatB_tile(blockDim.x,blockDim.y)
accu
<=0
/*AccumulateCtilebytile.
*/
fortileIdx=0to(K/blockDim.x-1)do
/*Load
onetileofAandonetileofBintosharedmem*/
//RowiofmatrixA
i<=
blockIdx.y
*blockDim.y+threadIdx.y
//ColumnjofmatrixA
j<=
tileIdx
*blockDim.x+threadIdx.x
//LoadA(i,j)tosharedmem
A_tile(threadIdx.y,threadIdx.x)
<=A_gpu(i,j)
//LoadB(i,j)tosharedmem
B_tile(threadIdx.y,threadIdx.x)
<=B_gpu(i,j)
//NoSharedMemBankconflict
//Synchronizebeforecomputation
__sync()
/*AccumulateonetileofC
fromtilesofAandBinsharedmem*/
fork=0tothreadDim.xdo
//AccumulateformatrixC
//NoSharedMemBankconflict
accu
<=accu
+A_tile(threadIdx.y,k)*
B_tile(k,
threadIdx.x)
end
//
Synchronize
__sync()
end
//RowiofmatrixC
i<=blockIdx.y*blockDim.y+threadIdx.y
//ColumnjofmatrixC
j<=blockIdx.x*blockDim.x+threadIdx.x
//StoreaccumulatedvaluetoC(i,j)
C_gpu(i,j)
<=accu
}
ComputationOptimization
Thekerneliscomputationbound.Therefore,weneedtoincreasetheportionofusefulfloatingpointoperationintotalinstructions.Becausetheinnerproductconsumesmostofthetime,itisimportanttomakesurethispartisefficient.Ifwecheckthebinarycodefortheinnerproduct,wewilldiscoveronelineofcodeinCUDAtakestwoinstructionsinthebinary./*CUDAcodeforinnerproduct*/
accu
<=accu
+A_tile(threadIdx.y,k)*
B_tile(k,threadIdx.x)
/*Disassembledfromcubinbinary*/
mov.b32$r0,s[$ofs4+0x0000]
mad.rn.f32$r9,s[$ofs1+0x002c],$r0,$r9ThecurrentarchitectureofStreamMultiprocessor(SM)onlyallowsonesourceoperandfromthesharedmemory.However,computingtheinnerproductrequirestwosourceoperandsfromfromthesharedmemory.OnesolutionistostorematrixAormatrixBintoregisterfile,butthenthematrixintheregisterfilecannotbesharedbydifferentthreads,whichdecreasesthe"computation-to-memoryratio".Abettersolutionistoperformouterproductinsteadofinnerproduct.Inthiscase,matrixAisstoredinsharedmemory,butmatrixBandCarestoredinregisters.TheouterproductdoesnotrequiresharingofmatrixBandmatrixC,therefore,eachthreadonlystoresoneelementofBandonecolumnofthetileofCintheregister.The"computation-to-memoryratio"oftheouterproductisthesameastheinnerproduct.
/*CUDAcodeforouterproduct*/
/*accu[i]andbarestoredinregisterfile*/
accu[i]
<=accu
[i]+A_tile(i)*b
/*Disassembledfromcubinbinary*/
mad.rn.f32$r9,s[$ofs2+0x0010],$r29,$r9HereisanexampleofmultiplyingtileA0,0andtileB0,0tocomputeC0,0usingouterproduct.Inthisexample,A0,0is16x16,B0,0is16x64,C0,0is16x64.Athreadblockof64threadsisperformingcomputingC0,0.
Step1:loadA0,0tosharedmemory.
Step2:use16iterationstoupdateC0,0.
EachthreadstoresoneelementofB0,0initsregister.EachthreadalsostoresonecolumnofC0,0initsregister.Iteration1:outerproductbetweenthefirstcolumnofA0,0andthefirstrowofB0,0,andupdateC0,0.Step3:eachthreadstoresonecolumnofC0,0fromitsregistertoglobalmemory.
LoopUnrolling
Usethepragmatotellthecompilertounrolltheloops.Thenvccwillunrolltheinnerloopsbydefault.Butitwillnotunrolltheouterloopunlesstoldbythepragma.#pragmaunrollLoopunrollingsometimeshassideeffectsonregisterusage,whichmaylimitthenumberofconcurrentthreads.However,theloopunrollingdoesnotincreaseregisterusageinthematrixMulexample.
Prefetching
/*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
__shared__floatA_tile0(blockDim.y,blockDim.x)
__shared__floatA_tile1(blockDim.x,blockDim.y)
float*pointer0=
A_tile0
float*pointer1=
A_tile1
fetchonetileofmatrixA_gpu
topointer0
__sync()
/*AccumulateCtilebytile.
*/
fortileIdx=0to(K/blockDim.x-1)do
prefetch
onetileofmatrixA_gpu
topointer1
accumulateCusingpointer0
__sync()
swappointer0andpointer1
end
storetileCtoglobalmemory
}ExamplewrittenbyZhenyuYe(z.ye@tue.nl)
相关文章推荐
- android studio快捷键大全
- inner join, left join, right join, full join 的区别?
- 在 WinForm/WPF 下制作 Google Material Design 风格程序
- ArcGIS Runtime SDK for WPF已不更新,后续将被ArcGIS Runtime SDK for .NET取代
- hdu 1498 50 years, 50 colors(二分匹配_匈牙利算法)
- HNU Joke with permutation (深搜dfs)
- hdu 1151 Air Raid(最小路径覆盖)
- hdu 5328 Problem Killer(杭电多校赛第四场)
- hdu 5319 Painter(杭电多校赛第三场)
- ios9 苹果原生视频播放器
- hdu 5326 Work(杭电多校赛第三场)
- hdu 1150 Machine Schedule(二分匹配,简单匈牙利算法)
- hdu 1281 棋盘游戏(二分匹配)
- UVA 12545 Bits Equalizer
- 随记
- 算法之匈牙利算法
- I题 hdu 1234 开门人和关门人
- H题 hdu 2520 我是菜鸟,我怕谁
- G题 hdu 1466 计算直线的交点数
- F题 hdu 1431 素数回文