您的位置：首页 > 其它

GPU 矩阵乘法（转）

2015-10-12 14:25 399 查看

MatrixmultiplicationinCUDA

Matrixmultiplicationisafundamentalbuildingblockforscientificcomputing.Moreover,thealgorithmicpatternsofmatrixmultiplicationarerepresentative.Manyotheralgorithmssharesimilaroptimizationtechniquesasmatrixmultiplication.Therefore,matrixmultiplicationisoneofthemostimportantexamplesinlearningparallelprogramming.Thispageprovidesastep-by-stepexampletooptimizematrixmultiplication.Theexplanationisgivenintheremainingpartofthispage.Youcandownloadthecodeandrunitfollowingtheinstructionbelow.Youcandownloadthesourcecode"cuda.zip".Assumeyoucopycodesamplestoyourlocaldirectory;underthefolder~/examples/cuda/.Thenyoucanunzipthedownloadedpackage.YoucancompilethisexampleusingtheMakefileinthatfolder,i.e.,gotothefolderandtypethecommand"make".Thecommandsarealsoshownbelow.Pleasefirstmakesureyouhavesetuptheenvironmentofthetoolsproperlyandcancompiletheofficialcodeexamples.

cd~/examples/cuda/

wgethttp://www.es.ele.tue.nl/~mwijtvliet/5KK73/downloads/cuda.zip-Ocuda.zip

unzip./cuda.zip

cdcuda

make

./

matrixmul

1ThematrixMulProblem2NaiveImplementationOnCPUs3NaiveImplementationOnGPUs4Increasing"Computatin-to-MemoryRatio"byTiling5GlobalMemoryCoalescing6AvoidingSharedMemoryBankConflict7ComputationOptimization7.1Step1:loadA0,0tosharedmemory.7.2Step2:use16iterationstoupdateC0,0.7.3Step3:eachthreadstoresonecolumnofC0,0fromitsregistertoglobalmemory.8LoopUnrolling9Prefetching

ThematrixMulProblem

GivenanMxKmatrixAandaKxNmatrixB,multiplyAwithBandstoretheresultintoaMxNmatrixC.ThematrixMulexampleonthispagewillshowseveraltechniquestooptimizematrixmultiplicationonGPU.Mostofthemaregeneric,whichcanbeappliedtootherapplications.Thesetechniquesare:TilingMemorycoalescingAvoidingmemorybankconflictsComputationOptimization.LoopunrollingPrefetchingTheperformanceoftheseoptimizationtechniquesareshowinthefiguresbelow(clinkthefiguretoenlarge).Note:Theseoptimizations,whicharetunedforNVIDIA8800GTGPUatmatrixsizeof4096x4096,couldbesub-optimalforotherGPUsandothermatrixsizes.WewillstartwithasimpleserialcoderunningonCPU,andthengothroughtheseoptimizationsstepbystep.

NaiveImplementationOnCPUs

voidmain(){

defineA,B,C

fori=0toMdo

forj=0toNdo

/*computeelementC(i,j)

*/

fork=0toKdo

C(i,j)<=C(i,j)+A(i,k)*B(k,j)

end

end

end

Tosimplifytheexplanation,squarematrices(M==N==K)areusedintheillustrations.ThefigurebelowshowsthememoryfootprinttocomputeanelementC3,11(inred).ThiscanbeviewedastheinnerproductofonerowofA(inblue)andonecolumnofB(ingreen).

NaiveImplementationOnGPUs

/*CodesrunningonCPU*/

voidmain(){

defineA_cpu,B_cpu,C_cpuintheCPUmemory

defineA_gpu,B_gpu,C_gpuintheGPUmemory

memcopyA_cputoA_gpu

memcopyB_cputoB_gpu

dim3dimBlock(16,16)

dim3dimGrid(N/dimBlock.x,M/dimBlock.y)

matrixMul<<<dimGrid,dimBlock>>>(A_gpu,B_gpu,C_gpu,K)

memcopyC_gputoC_cpu

/*CodesrunningonGPU*/

__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){

temp<=0

i<=blockIdx.y*blockDim.y+threadIdx.y

//
RowiofmatrixC

j<=blockIdx.x*blockDim.x+threadIdx.x

//ColumnjofmatrixC

fork=0toK-1do

accu

<=accu

+A_gpu(i,k)*B_gpu(k,j)

end

C_gpu(i,j)<=accu

AnaiveimplementationonGPUsassignsonethreadtocomputeoneelementofmatrixC.EachthreadloadsonerowofmatrixAandonecolumnofmatrixBfromglobalmemory,dotheinnerproduct,andstoretheresultbacktomatrixCintheglobalmemory.ThefigureshowsthememoryfootprintofonethreadonglobalmemorywherematrixA,B,andCarestored.Inthenaiveimplementation,theamountofcomputationis2xMxNxKflop,whiletheamountofglobalmemoryaccessis2xMxNxKword.The"computation-to-memoryratio"isapproximately1/4(flop/byte).Therefore,thenaiveimplementationisbandwidthbounded.

Increasing"Computatin-to-MemoryRatio"byTiling

Toincreasethe"computation-to-memoryratio",thetiledmatrixmultiplicationcanbeapplied.OnethreadblockcomputesonetileofmatrixC.Onethreadinthethreadblockcomputesoneelementofthetile.Thefigureshowsa32x32matrixdividedintofour16x16tiles.Tocomputethis,fourthreadblockseachwith16x16threadscanbecreated.TheGPUkernelcomputesCinmultipleiterations.Ineachiteration,onethreadblockloadsonetileofAandonetileofBfromglobalmemorytosharedmemory,performscomputation,andstorestemporalresultofCinregister.Afteralltheiterationisdone,thethreadblockstoresonetileofCintoglobalmemory.Forexample,athreadblockcancomputerC0,0intwoiterations:C0,0=A0,0B0,0+A0,1B1,0.Inthefirstiteration,thethreadblockloadstileA0,0andtileB0,0fromglobalmemoryintosharedmemory.EachthreadperformsinnerproducttoproduceoneelementofC.ThiselementofCisstoredintheregister,whichwillbeaccumulatedinthenextiteration.Intheseconditeration,thethreadblockloadstileA0,1andtileB1,0fromglobalmemoryintosharedmemory.EachthreadperformstheinnerproducttoproduceoneelementofC,whichisaccumulatedwithpreviousvalue.Ifthisisthefinaliteration,thentheelementofCintheregisterfilewillbestoredbackintoglobalmemory.TheCPUcoderemainsthesame.HereonlyshowstheGPUkernel.

/*CodesrunningonGPU*/

__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){

__shared__floatA_tile(blockDim.y,blockDim.x)

__shared__floatB_tile(blockDim.x,blockDim.y)

accu

<=0

/*AccumulateCtilebytile.

*/

fortileIdx=0to(K/blockDim.x-1)do

/*Load

onetileofAandonetileofBintosharedmem*/

//RowiofmatrixA

i<=

blockIdx.y

*blockDim.y+threadIdx.y

//ColumnjofmatrixA

j<=

tileIdx

*blockDim.x+threadIdx.x

//LoadA(i,j)tosharedmem

A_tile(threadIdx.y,threadIdx.x)

<=A_gpu(i,j)

//LoadB(j,i)tosharedmem

B_tile(threadIdx.x,threadIdx.y)

<=

B_gpu(j,i)

//GlobalMemNotcoalesced

//Synchronizebeforecomputation

__sync()

/*AccumulateonetileofC

fromtilesofAandBinsharedmem*/

fork=0tothreadDim.xdo

//AccumulateformatrixC

accu

<=accu

+A_tile(threadIdx.y,k)*B_tile(k,threadIdx.x)

end

//

Synchronize

__sync()

end

//RowiofmatrixC

i<=blockIdx.y*blockDim.y+threadIdx.y

//ColumnjofmatrixC

j<=blockIdx.x*blockDim.x+threadIdx.x

//StoreaccumulatedvaluetoC(i,j)

C_gpu(i,j)

<=accu

Inthetiledimplementation,theamountofcomputationisstill2xMxNxKflop.However,usingtilesizeofB,theamountofglobalmemoryaccessis2xMxNxK/Bword.The"computation-to-memoryratio"isapproximatelyB/4(flop/byte).Wenowcantunethe"computation-to-memory"ratiobychangingthetilesizeB.

GlobalMemoryCoalescing

TwodimensionalarraysinC/C++arerow-major.Inthetiledimplementationabove,neighbouringthreadshavecoalescedaccesstomatrixA,butdonothavecoalescedaccesstomatrixB.Incolumn-majorlanguages,suchasFortran,theproblemistheotherwayaround.AnobvioussolutionistotransposematrixBbyCPUbeforeoffloadingittoGPUmemory.

/*CodesrunningonGPU*/

__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){

__shared__floatA_tile(blockDim.y,blockDim.x)

__shared__floatB_tile(blockDim.x,blockDim.y)

accu

<=0

/*AccumulateCtilebytile.

*/

fortileIdx=0to(K/blockDim.x-1)do

/*Load

onetileofAandonetileofBintosharedmem*/

//RowiofmatrixA

i<=

blockIdx.y

*blockDim.y+threadIdx.y

//ColumnjofmatrixA

j<=

tileIdx

*blockDim.x+threadIdx.x

//LoadA(i,j)tosharedmem

A_tile(threadIdx.y,threadIdx.x)

<=A_gpu(i,j)

//LoadB(i,j)tosharedmem

B_tile(threadIdx.x,threadIdx.y)

<=

B_gpu(i,j)

//GlobalMemCoalesced

//Synchronizebeforecomputation

__sync()

/*AccumulateonetileofC

fromtilesofAandBinsharedmem*/

fork=0tothreadDim.xdo

//AccumulateformatrixC

//SharedMemBankconflict

accu

<=accu

+A_tile(threadIdx.y,k)*

B_tile(threadIdx.x,k)

end

//

Synchronize

__sync()

end

//RowiofmatrixC

i<=blockIdx.y*blockDim.y+threadIdx.y

//ColumnjofmatrixC

j<=blockIdx.x*blockDim.x+threadIdx.x

//StoreaccumulatedvaluetoC(i,j)

C_gpu(i,j)

<=accu

AvoidingSharedMemoryBankConflict

/*CodesrunningonGPU*/

__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){

__shared__floatA_tile(blockDim.y,blockDim.x)

__shared__floatB_tile(blockDim.x,blockDim.y)

accu

<=0

/*AccumulateCtilebytile.

*/

fortileIdx=0to(K/blockDim.x-1)do

/*Load

onetileofAandonetileofBintosharedmem*/

//RowiofmatrixA

i<=

blockIdx.y

*blockDim.y+threadIdx.y

//ColumnjofmatrixA

j<=

tileIdx

*blockDim.x+threadIdx.x

//LoadA(i,j)tosharedmem

A_tile(threadIdx.y,threadIdx.x)

<=A_gpu(i,j)

//LoadB(i,j)tosharedmem

B_tile(threadIdx.y,threadIdx.x)

<=B_gpu(i,j)

//NoSharedMemBankconflict

//Synchronizebeforecomputation

__sync()

/*AccumulateonetileofC

fromtilesofAandBinsharedmem*/

fork=0tothreadDim.xdo

//AccumulateformatrixC

//NoSharedMemBankconflict

accu

<=accu

+A_tile(threadIdx.y,k)*

B_tile(k,

threadIdx.x)

end

//

Synchronize

__sync()

end

//RowiofmatrixC

i<=blockIdx.y*blockDim.y+threadIdx.y

//ColumnjofmatrixC

j<=blockIdx.x*blockDim.x+threadIdx.x

//StoreaccumulatedvaluetoC(i,j)

C_gpu(i,j)

<=accu

ComputationOptimization

Thekerneliscomputationbound.Therefore,weneedtoincreasetheportionofusefulfloatingpointoperationintotalinstructions.Becausetheinnerproductconsumesmostofthetime,itisimportanttomakesurethispartisefficient.Ifwecheckthebinarycodefortheinnerproduct,wewilldiscoveronelineofcodeinCUDAtakestwoinstructionsinthebinary.

/*CUDAcodeforinnerproduct*/

accu

<=accu

+A_tile(threadIdx.y,k)*

B_tile(k,threadIdx.x)

/*Disassembledfromcubinbinary*/

mov.b32$r0,s[$ofs4+0x0000]

mad.rn.f32$r9,s[$ofs1+0x002c],$r0,$r9

ThecurrentarchitectureofStreamMultiprocessor(SM)onlyallowsonesourceoperandfromthesharedmemory.However,computingtheinnerproductrequirestwosourceoperandsfromfromthesharedmemory.OnesolutionistostorematrixAormatrixBintoregisterfile,butthenthematrixintheregisterfilecannotbesharedbydifferentthreads,whichdecreasesthe"computation-to-memoryratio".Abettersolutionistoperformouterproductinsteadofinnerproduct.Inthiscase,matrixAisstoredinsharedmemory,butmatrixBandCarestoredinregisters.TheouterproductdoesnotrequiresharingofmatrixBandmatrixC,therefore,eachthreadonlystoresoneelementofBandonecolumnofthetileofCintheregister.The"computation-to-memoryratio"oftheouterproductisthesameastheinnerproduct.

/*CUDAcodeforouterproduct*/

/*accu[i]andbarestoredinregisterfile*/

accu[i]

<=accu

[i]+A_tile(i)*b

/*Disassembledfromcubinbinary*/

mad.rn.f32$r9,s[$ofs2+0x0010],$r29,$r9

HereisanexampleofmultiplyingtileA0,0andtileB0,0tocomputeC0,0usingouterproduct.Inthisexample,A0,0is16x16,B0,0is16x64,C0,0is16x64.Athreadblockof64threadsisperformingcomputingC0,0.

Step1:loadA0,0tosharedmemory.

Step2:use16iterationstoupdateC0,0.

EachthreadstoresoneelementofB0,0initsregister.EachthreadalsostoresonecolumnofC0,0initsregister.Iteration1:outerproductbetweenthefirstcolumnofA0,0andthefirstrowofB0,0,andupdateC0,0.Iteration2:outerproductbetweenthesecondcolumnofA0,0andthesecondrowofB0,0,andupdateC0,0.Continuetheiteration3,4,...,15insimilarway.Iteration16:outerproductbetweenthe16thcolumnofA0,0andthe16throwofB0,0,andupdateC0,0.

Step3:eachthreadstoresonecolumnofC0,0fromitsregistertoglobalmemory.

LoopUnrolling

Usethepragmatotellthecompilertounrolltheloops.Thenvccwillunrolltheinnerloopsbydefault.Butitwillnotunrolltheouterloopunlesstoldbythepragma.

#pragmaunroll

Loopunrollingsometimeshassideeffectsonregisterusage,whichmaylimitthenumberofconcurrentthreads.However,theloopunrollingdoesnotincreaseregisterusageinthematrixMulexample.

Prefetching

/*CodesrunningonGPU*/

__global__voidmatrixMul(A_gpu,B_gpu,C_gpu,K){

__shared__floatA_tile0(blockDim.y,blockDim.x)

__shared__floatA_tile1(blockDim.x,blockDim.y)

float*pointer0=

A_tile0

float*pointer1=

A_tile1

fetchonetileofmatrixA_gpu

topointer0

__sync()

/*AccumulateCtilebytile.

*/

fortileIdx=0to(K/blockDim.x-1)do

prefetch

onetileofmatrixA_gpu

topointer1

accumulateCusingpointer0

__sync()

swappointer0andpointer1

end

storetileCtoglobalmemory

ExamplewrittenbyZhenyuYe(z.ye@tue.nl)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航