您的位置:首页 > 其它

GPU 矩阵乘法(转)

2015-10-12 14:25 399 查看

MatrixmultiplicationinCUDA

Matrixmultiplicationisafundamentalbuildingblockforscientificcomputing.Moreover,thealgorithmicpatternsofmatrixmultiplicationarerepresentative.Manyotheralgorithmssharesimilaroptimizationtechniquesasmatrixmultiplication.Therefore,matrixmultiplicationisoneofthemostimportantexamplesinlearningparallelprogramming.Thispageprovidesastep-by-stepexampletooptimizematrixmultiplication.Theexplanationisgivenintheremainingpartofthispage.Youcandownloadthecodeandrunitfollowingtheinstructionbelow.Youcandownloadthesourcecode"cuda.zip".Assumeyoucopycodesamplestoyourlocaldirectory;underthefolder~/examples/cuda/.Thenyoucanunzipthedownloadedpackage.YoucancompilethisexampleusingtheMakefileinthatfolder,i.e.,gotothefolderandtypethecommand"make".Thecommandsarealsoshownbelow.Pleasefirstmakesureyouhavesetuptheenvironmentofthetoolsproperlyandcancompiletheofficialcodeexamples.
cd~/examples/cuda/
wgethttp://www.es.ele.tue.nl/~mwijtvliet/5KK73/downloads/cuda.zip-Ocuda.zip
unzip./cuda.zip
cdcuda
make
./
matrixmul

Contents

1ThematrixMulProblem2NaiveImplementationOnCPUs3NaiveImplementationOnGPUs4Increasing"Computatin-to-MemoryRatio"byTiling5GlobalMemoryCoalescing6AvoidingSharedMemoryBankConflict7ComputationOptimization7.1Step1:loadA0,0tosharedmemory.7.2Step2:use16iterationstoupdateC0,0.7.3Step3:eachthreadstoresonecolumnofC0,0fromitsregistertoglobalmemory.8LoopUnrolling9Prefetching

ThematrixMulProblem

GivenanMxKmatrixAandaKxNmatrixB,multiplyAwithBandstoretheresultintoaMxNmatrixC.ThematrixMulexampleonthispagewillshowseveraltechniquestooptimizematrixmultiplicationonGPU.Mostofthemaregeneric,whichcanbeappliedtootherapplications.Thesetechniquesare:TilingMemorycoalescingAvoidingmemorybankconflictsComputationOptimization.LoopunrollingPrefetchingTheperformanceoftheseoptimizationtechniquesareshowinthefiguresbelow(clinkthefiguretoenlarge).Note:Theseoptimizations,whicharetunedforNVIDIA8800GTGPUatmatrixsizeof4096x4096,couldbesub-optimalforotherGPUsandothermatrixsizes.WewillstartwithasimpleserialcoderunningonCPU,andthengothroughtheseoptimizationsstepbystep.

NaiveImplementationOnCPUs

voidmain(){
defineA,B,C
fori=0toMdo
forj=0toNdo
/*computeelementC(i,j)
*/
fork=0toKdo
C(i,j)<=C(i,j)+A(i,k)*B(k,j)
end
end
end
}
Tosimplifytheexplanation,squarematrices(M==N==K)areusedintheillustrations.ThefigurebelowshowsthememoryfootprinttocomputeanelementC3,11(inred).ThiscanbeviewedastheinnerproductofonerowofA(inblue)andonecolumnofB(ingreen).

NaiveImplementationOnGPUs

/*CodesrunningonCPU*/
voidmain(){
defineA_cpu,B_cpu,C_cpuintheCPUmemory
defineA_gpu,B_gpu,C_gpuintheGPUmemory
memcopyA_cputoA_gpu
memcopyB_cputoB_gpu
dim3dimBlock(16,16)
dim3dimGrid(N/dimBlock.x,M/dimBlock.y)
matrixMul<<<dimGrid,dimBlock>>>(A_gpu,B_gpu,C_gpu,K)
memcopyC_gputoC_cpu
}
/*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
temp<=0
i<=blockIdx.y*blockDim.y+threadIdx.y
//
RowiofmatrixC
j<=blockIdx.x*blockDim.x+threadIdx.x
//ColumnjofmatrixC
fork=0toK-1do
accu
<=accu
+A_gpu(i,k)*B_gpu(k,j)
end
C_gpu(i,j)<=accu
}
AnaiveimplementationonGPUsassignsonethreadtocomputeoneelementofmatrixC.EachthreadloadsonerowofmatrixAandonecolumnofmatrixBfromglobalmemory,dotheinnerproduct,andstoretheresultbacktomatrixCintheglobalmemory.ThefigureshowsthememoryfootprintofonethreadonglobalmemorywherematrixA,B,andCarestored.Inthenaiveimplementation,theamountofcomputationis2xMxNxKflop,whiletheamountofglobalmemoryaccessis2xMxNxKword.The"computation-to-memoryratio"isapproximately1/4(flop/byte).Therefore,thenaiveimplementationisbandwidthbounded.

Increasing"Computatin-to-MemoryRatio"byTiling

Toincreasethe"computation-to-memoryratio",thetiledmatrixmultiplicationcanbeapplied.OnethreadblockcomputesonetileofmatrixC.Onethreadinthethreadblockcomputesoneelementofthetile.Thefigureshowsa32x32matrixdividedintofour16x16tiles.Tocomputethis,fourthreadblockseachwith16x16threadscanbecreated.TheGPUkernelcomputesCinmultipleiterations.Ineachiteration,onethreadblockloadsonetileofAandonetileofBfromglobalmemorytosharedmemory,performscomputation,andstorestemporalresultofCinregister.Afteralltheiterationisdone,thethreadblockstoresonetileofCintoglobalmemory.Forexample,athreadblockcancomputerC0,0intwoiterations:C0,0=A0,0B0,0+A0,1B1,0.Inthefirstiteration,thethreadblockloadstileA0,0andtileB0,0fromglobalmemoryintosharedmemory.EachthreadperformsinnerproducttoproduceoneelementofC.ThiselementofCisstoredintheregister,whichwillbeaccumulatedinthenextiteration.Intheseconditeration,thethreadblockloadstileA0,1andtileB1,0fromglobalmemoryintosharedmemory.EachthreadperformstheinnerproducttoproduceoneelementofC,whichisaccumulatedwithpreviousvalue.Ifthisisthefinaliteration,thentheelementofCintheregisterfilewillbestoredbackintoglobalmemory.TheCPUcoderemainsthesame.HereonlyshowstheGPUkernel.
/*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
__shared__floatA_tile(blockDim.y,blockDim.x)
__shared__floatB_tile(blockDim.x,blockDim.y)
accu
<=0
/*AccumulateCtilebytile.
*/
fortileIdx=0to(K/blockDim.x-1)do
/*Load
onetileofAandonetileofBintosharedmem*/
//RowiofmatrixA
i<=
blockIdx.y
*blockDim.y+threadIdx.y
//ColumnjofmatrixA
j<=
tileIdx
*blockDim.x+threadIdx.x
//LoadA(i,j)tosharedmem
A_tile(threadIdx.y,threadIdx.x)
<=A_gpu(i,j)
//LoadB(j,i)tosharedmem
B_tile(threadIdx.x,threadIdx.y)
<=
B_gpu(j,i)
//GlobalMemNotcoalesced
//Synchronizebeforecomputation
__sync()
/*AccumulateonetileofC
fromtilesofAandBinsharedmem*/
fork=0tothreadDim.xdo
//AccumulateformatrixC
accu
<=accu
+A_tile(threadIdx.y,k)*B_tile(k,threadIdx.x)
end
//
Synchronize
__sync()
end
//RowiofmatrixC
i<=blockIdx.y*blockDim.y+threadIdx.y
//ColumnjofmatrixC
j<=blockIdx.x*blockDim.x+threadIdx.x
//StoreaccumulatedvaluetoC(i,j)
C_gpu(i,j)
<=accu
}
Inthetiledimplementation,theamountofcomputationisstill2xMxNxKflop.However,usingtilesizeofB,theamountofglobalmemoryaccessis2xMxNxK/Bword.The"computation-to-memoryratio"isapproximatelyB/4(flop/byte).Wenowcantunethe"computation-to-memory"ratiobychangingthetilesizeB.

GlobalMemoryCoalescing

TwodimensionalarraysinC/C++arerow-major.Inthetiledimplementationabove,neighbouringthreadshavecoalescedaccesstomatrixA,butdonothavecoalescedaccesstomatrixB.Incolumn-majorlanguages,suchasFortran,theproblemistheotherwayaround.AnobvioussolutionistotransposematrixBbyCPUbeforeoffloadingittoGPUmemory.
/*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
__shared__floatA_tile(blockDim.y,blockDim.x)
__shared__floatB_tile(blockDim.x,blockDim.y)
accu
<=0
/*AccumulateCtilebytile.
*/
fortileIdx=0to(K/blockDim.x-1)do
/*Load
onetileofAandonetileofBintosharedmem*/
//RowiofmatrixA
i<=
blockIdx.y
*blockDim.y+threadIdx.y
//ColumnjofmatrixA
j<=
tileIdx
*blockDim.x+threadIdx.x
//LoadA(i,j)tosharedmem
A_tile(threadIdx.y,threadIdx.x)
<=A_gpu(i,j)
//LoadB(i,j)tosharedmem
B_tile(threadIdx.x,threadIdx.y)
<=
B_gpu(i,j)
//GlobalMemCoalesced
//Synchronizebeforecomputation
__sync()
/*AccumulateonetileofC
fromtilesofAandBinsharedmem*/
fork=0tothreadDim.xdo
//AccumulateformatrixC
//SharedMemBankconflict
accu
<=accu
+A_tile(threadIdx.y,k)*
B_tile(threadIdx.x,k)
end
//
Synchronize
__sync()
end
//RowiofmatrixC
i<=blockIdx.y*blockDim.y+threadIdx.y
//ColumnjofmatrixC
j<=blockIdx.x*blockDim.x+threadIdx.x
//StoreaccumulatedvaluetoC(i,j)
C_gpu(i,j)
<=accu
}

AvoidingSharedMemoryBankConflict

/*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
__shared__floatA_tile(blockDim.y,blockDim.x)
__shared__floatB_tile(blockDim.x,blockDim.y)
accu
<=0
/*AccumulateCtilebytile.
*/
fortileIdx=0to(K/blockDim.x-1)do
/*Load
onetileofAandonetileofBintosharedmem*/
//RowiofmatrixA
i<=
blockIdx.y
*blockDim.y+threadIdx.y
//ColumnjofmatrixA
j<=
tileIdx
*blockDim.x+threadIdx.x
//LoadA(i,j)tosharedmem
A_tile(threadIdx.y,threadIdx.x)
<=A_gpu(i,j)
//LoadB(i,j)tosharedmem
B_tile(threadIdx.y,threadIdx.x)
<=B_gpu(i,j)
//NoSharedMemBankconflict
//Synchronizebeforecomputation
__sync()
/*AccumulateonetileofC
fromtilesofAandBinsharedmem*/
fork=0tothreadDim.xdo
//AccumulateformatrixC
//NoSharedMemBankconflict
accu
<=accu
+A_tile(threadIdx.y,k)*
B_tile(k,
threadIdx.x)
end
//
Synchronize
__sync()
end
//RowiofmatrixC
i<=blockIdx.y*blockDim.y+threadIdx.y
//ColumnjofmatrixC
j<=blockIdx.x*blockDim.x+threadIdx.x
//StoreaccumulatedvaluetoC(i,j)
C_gpu(i,j)
<=accu
}

ComputationOptimization

Thekerneliscomputationbound.Therefore,weneedtoincreasetheportionofusefulfloatingpointoperationintotalinstructions.Becausetheinnerproductconsumesmostofthetime,itisimportanttomakesurethispartisefficient.Ifwecheckthebinarycodefortheinnerproduct,wewilldiscoveronelineofcodeinCUDAtakestwoinstructionsinthebinary.
/*CUDAcodeforinnerproduct*/
accu
<=accu
+A_tile(threadIdx.y,k)*
B_tile(k,threadIdx.x)
/*Disassembledfromcubinbinary*/
mov.b32$r0,s[$ofs4+0x0000]
mad.rn.f32$r9,s[$ofs1+0x002c],$r0,$r9
ThecurrentarchitectureofStreamMultiprocessor(SM)onlyallowsonesourceoperandfromthesharedmemory.However,computingtheinnerproductrequirestwosourceoperandsfromfromthesharedmemory.OnesolutionistostorematrixAormatrixBintoregisterfile,butthenthematrixintheregisterfilecannotbesharedbydifferentthreads,whichdecreasesthe"computation-to-memoryratio".Abettersolutionistoperformouterproductinsteadofinnerproduct.Inthiscase,matrixAisstoredinsharedmemory,butmatrixBandCarestoredinregisters.TheouterproductdoesnotrequiresharingofmatrixBandmatrixC,therefore,eachthreadonlystoresoneelementofBandonecolumnofthetileofCintheregister.The"computation-to-memoryratio"oftheouterproductisthesameastheinnerproduct.
/*CUDAcodeforouterproduct*/
/*accu[i]andbarestoredinregisterfile*/
accu[i]
<=accu
[i]+A_tile(i)*b
/*Disassembledfromcubinbinary*/
mad.rn.f32$r9,s[$ofs2+0x0010],$r29,$r9
HereisanexampleofmultiplyingtileA0,0andtileB0,0tocomputeC0,0usingouterproduct.Inthisexample,A0,0is16x16,B0,0is16x64,C0,0is16x64.Athreadblockof64threadsisperformingcomputingC0,0.

Step1:loadA0,0tosharedmemory.

Step2:use16iterationstoupdateC0,0.

EachthreadstoresoneelementofB0,0initsregister.EachthreadalsostoresonecolumnofC0,0initsregister.Iteration1:outerproductbetweenthefirstcolumnofA0,0andthefirstrowofB0,0,andupdateC0,0.Iteration2:outerproductbetweenthesecondcolumnofA0,0andthesecondrowofB0,0,andupdateC0,0.Continuetheiteration3,4,...,15insimilarway.Iteration16:outerproductbetweenthe16thcolumnofA0,0andthe16throwofB0,0,andupdateC0,0.

Step3:eachthreadstoresonecolumnofC0,0fromitsregistertoglobalmemory.

LoopUnrolling

Usethepragmatotellthecompilertounrolltheloops.Thenvccwillunrolltheinnerloopsbydefault.Butitwillnotunrolltheouterloopunlesstoldbythepragma.
#pragmaunroll
Loopunrollingsometimeshassideeffectsonregisterusage,whichmaylimitthenumberofconcurrentthreads.However,theloopunrollingdoesnotincreaseregisterusageinthematrixMulexample.

Prefetching

/*CodesrunningonGPU*/
__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){
__shared__floatA_tile0(blockDim.y,blockDim.x)
__shared__floatA_tile1(blockDim.x,blockDim.y)
float*pointer0=
A_tile0
float*pointer1=
A_tile1
fetchonetileofmatrixA_gpu
topointer0
__sync()
/*AccumulateCtilebytile.
*/
fortileIdx=0to(K/blockDim.x-1)do
prefetch
onetileofmatrixA_gpu
topointer1
accumulateCusingpointer0
__sync()
swappointer0andpointer1
end
storetileCtoglobalmemory
}
ExamplewrittenbyZhenyuYe(z.ye@tue.nl)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: