从内存的角度解释内存对齐的原理
2012-03-16 22:08
218 查看
目录
题记
一
内存读取粒度
Memoryaccessgranularity
从内存的角度解释内存对齐的原理
队列原理Alignmentfundamentals
Lazyprocessors
二速度Speed(内存对齐的基本原理)
代码解释
中文代码及其内存解释
三不懂内存对齐将造成的可能影响如下
四内存对齐规划
内存对齐的原因
对齐规则
试验
五作者
题记
下面的文章中是我对四个博客文章的合成,非原创,解释了内存对齐的原因,作用(中英文说明),及其规划!尤其适用于对sizeof结构体。
首先解释了内存对齐的原理,然后对作用进行了说明,最后是例子!其中中文对内存对齐,原作者做了详细的说明及其例子解释,需要注意的是,如
struct
{
chara;
intb;
charb
}A;
a在分配时候占用其中一个字节,剩下3个,但是b分配的是4字节,明显3个字节无法满足,那么就需要另外写入队列
人觉得第二个中文作者(按我最后说明博客地址顺序)提到的最重要的是画图是一个很不错的方法.
我在引用的第四个博客中,也就是最后的博客中,通过详细的代码解释,说明了内存对齐的规划问题!
内存对齐在系统或驱动级别以至于高真实时,高保密的程序开发的时候,程序内存分配问题仍旧是保证整个程序稳定,安全,高效的基础。所以对内存对齐需要学会掌握!至少在CSDN能说的来头!
文章可能很杂,如果看不懂,可以直接浏览中文部分,虽然我的博客很少人来看。
--QQ124045670
一内存读取粒度
Memoryaccessgranularity
从内存的角度解释内存对齐的原理
Programmersareconditionedtothinkofmemoryasasimplearrayofbytes.AmongCanditsdescendants,
Figure1.Howprogrammers
seememory
However,yourcomputer'sprocessordoesnotreadfromandwritetomemoryinbyte-sizedchunks.Instead,itaccessesmemoryintwo-,four-,eight-16-oreven32-bytechunks.We'llcallthesizeinwhichaprocessoraccessesmemory
itsmemoryaccessgranularity.
Figure2.Howprocessors
seememory
Thedifferencebetweenhowhigh-levelprogrammersthinkofmemoryandhowmodernprocessorsactuallyworkwithmemoryraisesinterestingissuesthatthisarticleexplores.
Ifyoudon'tunderstandandaddressalignmentissuesinyoursoftware,thefollowingscenarios,inincreasingorderofseverity,areallpossible:
Yoursoftwarewillrunslower.
Yourapplicationwilllockup.
Youroperatingsystemwillcrash.
Yoursoftwarewillsilentlyfail,yieldingincorrectresults.
队列原理
Alignmentfundamentals
Toillustratetheprinciplesbehindalignment,examineaconstanttask,andhowit'saffectedbyaprocessor'smemoryaccessgranularity.Thetaskissimple:firstreadfourbytesfromaddress0intotheprocessor'sregister.Then
readfourbytesfromaddress1intothesameregister.
Firstexaminewhatwouldhappenonaprocessorwithaone-bytememoryaccessgranularity:
Figure3.Single-byte
memoryaccessgranularity
Thisfitsinwiththenaiveprogrammer'smodelofhowmemoryworks:ittakesthesamefourmemoryaccessestoreadfromaddress0asitdoesfromaddress1.Nowseewhatwouldhappenonaprocessorwithtwo-bytegranularity,like
theoriginal68000:
Figure4.Double-byte
memoryaccessgranularity
Whenreadingfromaddress0,aprocessorwithtwo-bytegranularitytakeshalfthenumberofmemoryaccessesasaprocessorwithone-bytegranularity.Becauseeachmemoryaccessentailsafixedamountoverhead,minimizingthenumber
ofaccessescanreallyhelpperformance.
However,noticewhathappenswhenreadingfromaddress1.Becausetheaddressdoesn'tfallevenlyontheprocessor'smemoryaccessboundary,theprocessorhasextraworktodo.Suchanaddressisknownasan
unalignedaddress.Becauseaddress1isunaligned,aprocessorwithtwo-bytegranularitymustperformanextramemoryaccess,slowingdowntheoperation.
Finally,examinewhatwouldhappenonaprocessorwithfour-bytememoryaccessgranularity,likethe68030orPowerPC?601:
Figure5.Quad-bytememory
accessgranularity
Aprocessorwithfour-bytegranularitycanslurpupfourbytesfromanalignedaddresswithoneread.Alsonotethatreadingfromanunalignedaddressdoublestheaccesscount.
Nowthatyouunderstandthefundamentalsbehindaligneddataaccess,youcanexploresomeoftheissuesrelatedtoalignment.
Lazyprocessors
Aprocessorhastoperformsometrickswheninstructedtoaccessanunalignedaddress.Goingbacktotheexampleofreadingfourbytesfromaddress1onaprocessorwithfour-bytegranularity,youcanworkoutexactlywhatneeds
tobedone:
Figure6.Howprocessors
handleunalignedmemoryaccess
Theprocessorneedstoreadthefirstchunkoftheunalignedaddressandshiftoutthe"unwanted"bytesfromthefirstchunk.Thenitneedstoreadthesecondchunkoftheunalignedaddressandshiftoutsomeofitsinformation.
Finally,thetwoaremergedtogetherforplacementintheregister.It'salotofwork.
Someprocessorsjustaren'twillingtodoallofthatworkforyou.
Theoriginal68000wasaprocessorwithtwo-bytegranularityandlackedthecircuitrytocopewithunalignedaddresses.Whenpresentedwithsuchanaddress,theprocessorwouldthrowanexception.TheoriginalMacOSdidn'ttake
verykindlytothisexception,andwouldusuallydemandtheuserrestartthemachine.Ouch.
Laterprocessorsinthe680x0series,suchasthe68020,liftedthisrestrictionandperformedthenecessaryworkforyou.Thisexplainswhysomeoldsoftwarethatworksonthe68020crashesonthe68000.Italsoexplainswhy,way
backwhen,someoldMaccodersinitializedpointerswithoddaddresses.OntheoriginalMac,ifthepointerwasaccessedwithoutbeingreassignedtoavalidaddress,theMacwouldimmediatelydropintothedebugger.Oftentheycouldthenexaminethecalling
chainstackandfigureoutwherethemistakewas.
Allprocessorshaveafinitenumberoftransistorstogetworkdone.Addingunalignedaddressaccesssupportcutsintothis"transistorbudget."Thesetransistorscouldotherwisebeusedtomakeotherportionsoftheprocessorwork
faster,oraddnewfunctionalityaltogether.
AnexampleofaprocessorthatsacrificesunalignedaddressaccesssupportinthenameofspeedisMIPS.MIPSisagreatexampleofaprocessorthatdoesawaywithalmostallfrivolityinthenameofgettingrealworkdonefaster.
ThePowerPCtakesahybridapproach.EveryPowerPCprocessortodatehashardwaresupportforunaligned32-bitintegeraccess.Whileyoustillpayaperformancepenaltyforunalignedaccess,ittendstobesmall.
Ontheotherhand,modernPowerPCprocessorslackhardwaresupportforunaligned64-bitfloating-pointaccess.Whenaskedtoloadanunalignedfloating-pointnumberfrommemory,modernPowerPCprocessorswillthrowanexception
andhavetheoperatingsystemperformthealignmentchores
insoftware.Performingalignmentinsoftwareis
muchslowerthanperformingitinhardware.
二
速度Speed(内存对齐的基本原理)
内存对齐有一个好处是提高访问内存的速度,因为在许多数据结构中都需要占用内存,在很多系统中,要求内存分配的时候要对齐.下面是对为什么可以提高内存速度通过代码做了解释!
代码解释
Writingsometestsillustratestheperformancepenaltiesofunalignedmemoryaccess.Thetestissimple:youread,negate,andwritebackthenumbersinaten-megabytebuffer.Thesetestshavetwovariables:
Thesize,inbytes,inwhichyouprocessthebuffer.Firstyou'llprocessthebufferonebyteatatime.Thenyou'llmoveontotwo-,four-andeight-bytesatatime.
Thealignmentofthebuffer.You'llstaggerthealignmentofthebufferbyincrementingthepointertothebufferandrunningeachtestagain.
Thesetestswereperformedona800MHzPowerBookG4.Tohelpnormalizeperformancefluctuationsfrominterruptprocessing,eachtestwasruntentimes,keepingtheaverageoftheruns.Firstupisthetestthatoperatesonasingle
byteatatime:
Listing1.Mungingdataonebyteatatime
Ittookanaverageof67,364microsecondstoexecutethisfunction.Nowmodifyittoworkontwobytesatatimeinsteadofonebyteatatime--whichwillhalvethenumberofmemoryaccesses:
Listing2.Mungingdata
twobytesatatime
Thisfunctiontook48,765microsecondstoprocessthesameten-megabytebuffer--38%fasterthanMunge8.However,thatbufferwasaligned.Ifthebufferisunaligned,thetimerequiredincreasesto66,385microseconds--about
a27%speedpenalty.Thefollowingchartillustratestheperformancepatternofalignedmemoryaccessesversusunalignedaccesses:
Figure7.Single-byte
accessversusdouble-byteaccess
Thefirstthingyounoticeisthataccessingmemoryonebyteatatimeisuniformlyslow.Theseconditemofinterestisthatwhenaccessingmemorytwobytesatatime,whenevertheaddressisnotevenlydivisiblebytwo,that27%
speedpenaltyrearsitsuglyhead.
Nowuptheante,andprocessthebufferfourbytesatatime:
Listing3.Mungingdata
fourbytesatatime
Thisfunctionprocessesanalignedbufferin43,043microsecondsandanunalignedbufferin55,775microseconds,respectively.Thus,onthistestmachine,accessingunalignedmemoryfourbytesatatimeis
slowerthanaccessingalignedmemorytwobytesatatime:
Figure8.Single-versus
double-versusquad-byteaccess
Nowforthehorrorstory:processingthebuffereightbytesatatime.
Listing4.Mungingdata
eightbytesatatime
takesanamazing1,841,155microseconds--twoordersofmagnitudeslowerthanalignedaccess,anoutstanding4,610%performancepenalty!
Whathappened?BecausemodernPowerPCprocessorslackhardwaresupportforunalignedfloating-pointaccess,theprocessorthrowsanexception
foreachunalignedaccess.Theoperatingsystemcatchesthisexceptionandperformsthealignmentinsoftware.Here'sachartillustratingthepenalty,andwhenitoccurs:
Figure9.Multiple-byte
accesscomparison
Thepenaltiesforone-,two-andfour-byteunalignedaccessaredwarfedbythehorrendousunalignedeight-bytepenalty.Maybethischart,removingthetop(andthusthetremendousgulfbetweenthetwonumbers),willbeclearer:
Figure10.Multiple-byte
accesscomparison#2
There'sanothersubtleinsighthiddeninthisdata.Compareeight-byteaccessspeedsonfour-byteboundaries:
Figure11.Multiple-byte
accesscomparison#3
Noticeaccessingmemoryeightbytesatatimeonfour-andtwelve-byteboundaries
isslowerthanreadingthesamememoryfouroreventwobytesatatime.WhilePowerPCshavehardwaresupportforfour-bytealignedeight-bytedoubles,youstillpayaperformancepenaltyifyouusethatsupport.Granted,it's
nowherenearthe4,610%penalty,butit'scertainlynoticeable.Moralofthestory:accessingmemoryinlargechunkscanbeslowerthanaccessingmemoryinsmallchunks,ifthataccessisnotaligned.
Atomicity
Allmodernprocessorsofferatomicinstructions.Thesespecialinstructionsarecrucialforsynchronizingtwoormoreconcurrenttasks.Asthenameimplies,atomicinstructionsmustbe
indivisible--that'swhythey'resohandyforsynchronization:theycan'tbepreempted.
Itturnsoutthatinorderforatomicinstructionstoperformcorrectly,theaddressesyoupassthemmustbeatleastfour-bytealigned.Thisisbecauseofasubtleinteractionbetweenatomicinstructionsandvirtualmemory.
Ifanaddressisunaligned,itrequiresatleasttwomemoryaccesses.Butwhathappensifthedesireddataspanstwopagesofvirtualmemory?Thiscouldleadtoasituationwherethefirstpageisresidentwhilethelastpageis
not.Uponaccess,inthemiddleoftheinstruction,apagefaultwouldbegenerated,executingthevirtualmemorymanagementswap-incode,destroyingtheatomicityoftheinstruction.Tokeepthingssimpleandcorrect,boththe68KandPowerPCrequirethat
atomicallymanipulatedaddressesalwaysbeatleastfour-bytealigned.
Unfortunately,thePowerPCdoesnotthrowanexceptionwhenatomicallystoringtoanunalignedaddress.Instead,thestoresimplyalwaysfails.Thisisbadbecausemostatomicfunctionsarewrittentoretryuponafailedstore,
undertheassumptiontheywerepreempted.Thesetwocircumstancescombinetowhereyourprogramwillgointoaninfiniteloopifyouattempttoatomicallystoretoanunalignedaddress.Oops.
Altivec
Altivecisallaboutspeed.Unalignedmemoryaccessslowsdowntheprocessorandcostsprecioustransistors.Thus,theAltivecengineerstookapagefromtheMIPSplaybookandsimplydon'tsupportunalignedmemoryaccess.Because
Altivecworkswithsixteen-bytechunksatatime,alladdressespassedtoAltivecmustbesixteen-bytealigned.What'sscaryiswhathappensifyouraddressisnotaligned.
Altivecwon'tthrowanexceptiontowarnyouabouttheunalignedaddress.Instead,Altivecsimplyignoresthelowerfourbitsoftheaddressandchargesahead,
operatingonthewrongaddress.Thismeansyourprogrammaysilentlycorruptmemoryorreturnincorrectresultsifyoudon'texplicitlymakesureallyourdataisaligned.
ThereisanadvantagetoAltivec'sbit-strippingways.Becauseyoudon'tneedtoexplicitlytruncate(align-down)anaddress,thisbehaviorcansaveyouaninstructionortwowhenhandingaddressestotheprocessor.
ThisisnottosayAltiveccan'tprocessunalignedmemory.Youcanfinddetailedinstructionshowtodosoonthe
AltivecProgrammingEnvironmentsManual(see
Resources).Itrequiresmorework,butbecausememoryissoslowcomparedto
theprocessor,theoverheadforsuchshenanigansissurprisinglylow.
Structurealignment
Examinethefollowingstructure:
Listing5.Aninnocent
structure
voidMunge64(void*data,uint32_tsize){
typedefstruct{
chara;
longb;
charc;
}Struct;
Whatisthesizeofthisstructureinbytes?Manyprogrammerswillanswer"6bytes."Itmakessense:onebytefor
However,ifyouweretoaskyourcompilerto
First,backwardscompatibility.Rememberthe68000wasaprocessorwithtwo-bytememoryaccessgranularity,andwouldthrowanexceptionuponencounteringanoddaddress.Ifyouweretoreadfromorwritetofield
So,insteadoflayingoutyourfieldsjustthewayyouwrotethem,thecompiler
paddedthestructuresothat
Paddingistheactofaddingotherwiseunusedspacetoastructuretomakefieldslineupinadesiredway.Now,whenthe68020cameoutwithbuilt-inhardwaresupportforunalignedmemoryaccess,thispaddingwasunnecessary.However,
itdidn'thurtanything,anditevenhelpedalittleinperformance.
Thesecondreasonisefficiency.Nowadays,onPowerPCmachines,two-bytealignmentisnice,butfour-byteoreight-byteisbetter.Youprobablydon'tcareanymorethattheoriginal68000chokedonunalignedstructures,butyouprobably
careaboutpotential4,610%performancepenalties,whichcanhappenifa
中文代码及其内存解释
Examinethefollowingstructure:
首先由一个程序引入话题:
1//环境:vc6+windowssp2
2//程序1
3#include<iostream>
4
5usingnamespacestd;
6
7structst1
8{
9chara;
10intb;
11shortc;
12};
13
14structst2
15{
16shortc;
17chara;
18intb;
19};
20
21intmain()
22{
23cout<<"sizeof(st1)is"<<sizeof(st1)<<endl;
24cout<<"sizeof(st2)is"<<sizeof(st2)<<endl;
25return0;
26}
27
程序的输出结果为:
sizeof(st1)is12
sizeof(st2)is8
问题出来了,这两个一样的结构体,为什么sizeof的时候大小不一样呢?
本文的主要目的就是解释明白这一问题。
内存对齐,正是因为内存对齐的影响,导致结果不同。
对于大多数的程序员来说,内存对齐基本上是透明的,这是编译器该干的活,编译器为程序中的每个数据单元安排在合适的位置上,从而导致了相同的变量,不同声明顺序的结构体大小的不同。
那么编译器为什么要进行内存对齐呢?程序1中结构体按常理来理解sizeof(st1)和sizeof(st2)结果都应该是7,4(int)+2(short)+1(char)=7。经过内存对齐后,结构体的空间反而增大了。
在解释内存对齐的作用前,先来看下内存对齐的规则:
1、对于结构的各个成员,第一个成员位于偏移为0的位置,以后每个数据成员的偏移量必须是min(#pragmapack()指定的数,这个数据成员的自身长度)的倍数。
2、在数据成员完成各自对齐之后,结构(或联合)本身也要进行对齐,对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中,比较小的那个进行。
#pragmapack(n)表示设置为n字节对齐。VC6默认8字节对齐
以程序1为例解释对齐的规则:
St1:char占一个字节,起始偏移为0,int占4个字节,min(#pragmapack()指定的数,这个数据成员的自身长度)=4(VC6默认8字节对齐),所以int按4字节对齐,起始偏移必须为4的倍数,所以起始偏移为4,在char后编译器会添加3个字节的额外字节,不存放任意数据。short占2个字节,按2字节对齐,起始偏移为8,正好是2的倍数,无须添加额外字节。到此规则1的数据成员对齐结束,此时的内存状态为:
oxxx|oooo|oo
0123456789(地址)
(x表示额外添加的字节)
共占10个字节。还要继续进行结构本身的对齐,对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中,比较小的那个进行,st1结构中最大数据成员长度为int,占4字节,而默认的#pragmapack指定的值为8,所以结果本身按照4字节对齐,结构总大小必须为4的倍数,需添加2个额外字节使结构的总大小为12。此时的内存状态为:
oxxx|oooo|ooxx
0123456789ab(地址)
到此内存对齐结束。St1占用了12个字节而非7个字节。
St2的对齐方法和st1相同,读者可自己完成。
下面再看一个例子http://www.cppblog.com/cc/archive/2006/08/01/10765.html
内存对齐
在我们的程序中,数据结构还有变量等等都需要占有内存,在很多系统中,它都要求内存分配的时候要对齐,这样做的好处就是可以提高访问内存的速度。
我们还是先来看一段简单的程序:
程序一
1#include<iostream>
2usingnamespacestd;
3
4structX1
5{
6inti;//4个字节
7charc1;//1个字节
8charc2;//1个字节
9};
10
11structX2
12{
13charc1;//1个字节
14inti;//4个字节
15charc2;//1个字节
16};
17
18structX3
19{
20charc1;//1个字节
21charc2;//1个字节
22inti;//4个字节
23};
24intmain()
25{
26cout<<"long"<<sizeof(long)<<"\n";
27cout<<"float"<<sizeof(float)<<"\n";
28cout<<"int"<<sizeof(int)<<"\n";
29cout<<"char"<<sizeof(char)<<"\n";
30
31X1x1;
32X2x2;
33X3x3;
34cout<<"x1的大小"<<sizeof(x1)<<"\n";
35cout<<"x2的大小"<<sizeof(x2)<<"\n";
36cout<<"x3的大小"<<sizeof(x3)<<"\n";
37return0;
38}
这段程序的功能很简单,就是定义了三个结构X1,X2,X3,这三个结构的主要区别就是内存数据摆放的顺序,其他都是一样的,另外程序输入了几种基本类型所占用的字节数,以及我们这里的三个结构所占用的字节数。
这段程序的运行结果为:
1long4
2float4
3int4
4char1
5x1的大小8
6x2的大小12
7x3的大小8
结果的前面四行没有什么问题,但是我们在最后三行就可以看到三个结构占用的空间大小不一样,造成这个原因就是内部数据的摆放顺序,怎么会这样呢?
下面就是我们需要讲的内存对齐了。
内存是一个连续的块,我们可以用下面的图来表示,它是以4个字节对一个对齐单位的:
图一
让我们看看三个结构在内存中的布局:
首先是X1,如下图所示
X1中第一个是Int类型,它占有4字节,所以前面4格就是满了,然后第二个是char类型,这中类型只占一个字节,所以它占有了第二个4字节组块中的第一格,第三个也是char类型,所以它也占用一个字节,它就排在了第二个组块的第二格,因为它们加在一起大小也不超过一个块,所以他们三个变量在内存中的结构就是这样的,因为有内存分块对齐,所以最后出来的结果是8,而不是6,因为后面两个格子其实也算是被用了。
再次看看X2,如图所示
X2中第一个类型是Char类型,它占用一个字节,所以它首先排在第一组块的第一个格子里面,第二个是Int类型,它占用4个字节,第一组块已经用掉一格,还剩3格,肯定是无法放下第二Int类型的,因为要考虑到对齐,所以不得不把它放到第二个组块,第三个类型是Char类型,跟第一个类似。所因为有内存分块对齐,我们的内存就不是8个格子了,而是12个了。
再看看X3,如下图所示:
关于X3的说明其实跟X1是类似的,只不过它把两个1个字节的放到了前面,相信看了前面两种情况的说明这里也是很容易理解的。
Whatisthesizeofthisstructureinbytes?Manyprogrammerswillanswer"6bytes."Itmakessense:onebytefor
However,ifyouweretoaskyourcompilerto
First,backwardscompatibility.Rememberthe68000wasaprocessorwithtwo-bytememoryaccessgranularity,andwouldthrowanexceptionuponencounteringanoddaddress.Ifyouweretoreadfromor
writetofield
So,insteadoflayingoutyourfieldsjustthewayyouwrotethem,thecompiler
paddedthestructuresothat
Paddingistheactofaddingotherwiseunusedspacetoastructuretomakefieldslineupinadesiredway.Now,whenthe68020cameoutwithbuilt-inhardwaresupportforunalignedmemoryaccess,
thispaddingwasunnecessary.However,itdidn'thurtanything,anditevenhelpedalittleinperformance.
Thesecondreasonisefficiency.Nowadays,onPowerPCmachines,two-bytealignmentisnice,butfour-byteoreight-byteisbetter.Youprobablydon'tcareanymorethattheoriginal68000chokedon
unalignedstructures,butyouprobablycareaboutpotential4,610%performancepenalties,whichcanhappenifa
三
不懂内存对齐将造成的可能影响如下
Yoursoftwaremayhitperformance-killingunalignedmemoryaccessexceptions,whichinvoke
veryexpensivealignmentexceptionhandlers.
Yourapplicationmayattempttoatomicallystoretoanunalignedaddress,causingyourapplicationtolockup.
YourapplicationmayattempttopassanunalignedaddresstoAltivec,resultinginAltivecreadingfromand/orwritingtothewrongpartofmemory,silentlycorruptingdataoryieldingincorrectresults.
四
内存对齐规划
一、内存对齐的原因
大部分的参考资料都是如是说的:
1、平台原因(移植原因):不是所有的硬件平台都能访问任意地址上的任意数据的;某些硬件平台只能在某些地址处取某些特定类型的数据,否则抛出硬件异常。
2、性能原因:数据结构(尤其是栈)应该尽可能地在自然边界上对齐。原因在于,为了访问未对齐的内存,处理器需要作两次内存访问;而对齐的内存访问仅需要一次访问。
二、对齐规则
每个特定平台上的编译器都有自己的默认“对齐系数”(也叫对齐模数)。程序员可以通过预编译命令#pragmapack(n),n=1,2,4,8,16来改变这一系数,其中的n就是你要指定的“对齐系数”。
规则:
1、数据成员对齐规则:结构(struct)(或联合(union))的数据成员,第一个数据成员放在offset为0的地方,以后每个数据成员的对齐按照#pragmapack指定的数值和这个数据成员
自身长度中,比较小的那个进行。
2、结构(或联合)的整体对齐规则:在数据成员完成各自对齐之后,结构(或联合)本身也要进行对齐,对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中,比较小的那个进行。
3、结合1、2可推断:当#pragmapack的n值等于或超过所有数据成员长度的时候,这个n值的大小将不产生任何效果。
三、试验
下面我们通过一系列例子的详细说明来证明这个规则
编译器:GCC3.4.2、VC6.0
平台:WindowsXP
典型的struct对齐
struct定义:
#pragmapack(n)/*n=1,2,4,8,16*/
structtest_t{
inta;
charb;
shortc;
chard;
};
#pragmapack(n)
首先确认在试验平台上的各个类型的size,经验证两个编译器的输出均为:
sizeof(char)=1
sizeof(short)=2
sizeof(int)=4
试验过程如下:通过#pragmapack(n)改变“对齐系数”,然后察看sizeof(structtest_t)的值。
1、1字节对齐(#pragmapack(1))
输出结果:sizeof(structtest_t)=8[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(1)
structtest_t{
inta;/*长度4>1按1对齐;起始offset=00%1=0;存放位置区间[0,3]*/
charb;/*长度1=1按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2>1按1对齐;起始offset=55%1=0;存放位置区间[5,6]*/
chard;/*长度1=1按1对齐;起始offset=77%1=0;存放位置区间[7]*/
};
#pragmapack()
成员总大小=8
2)整体对齐
整体对齐系数=min((max(int,short,char),1)=1
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=8/*8%1=0*/[注1]
2、2字节对齐(#pragmapack(2))
输出结果:sizeof(structtest_t)=10[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(2)
structtest_t{
inta;/*长度4>2按2对齐;起始offset=00%2=0;存放位置区间[0,3]*/
charb;/*长度1<2按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2=2按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/
chard;/*长度1<2按1对齐;起始offset=88%1=0;存放位置区间[8]*/
};
#pragmapack()
成员总大小=9
2)整体对齐
整体对齐系数=min((max(int,short,char),2)=2
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=10/*10%2=0*/
3、4字节对齐(#pragmapack(4))
输出结果:sizeof(structtest_t)=12[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(4)
structtest_t{
inta;/*长度4=4按4对齐;起始offset=00%4=0;存放位置区间[0,3]*/
charb;/*长度1<4按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2<4按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/
chard;/*长度1<4按1对齐;起始offset=88%1=0;存放位置区间[8]*/
};
#pragmapack()
成员总大小=9
2)整体对齐
整体对齐系数=min((max(int,short,char),4)=4
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/
4、8字节对齐(#pragmapack(8))
输出结果:sizeof(structtest_t)=12[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(8)
structtest_t{
inta;/*长度4<8按4对齐;起始offset=00%4=0;存放位置区间[0,3]*/
charb;/*长度1<8按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2<8按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/
chard;/*长度1<8按1对齐;起始offset=88%1=0;存放位置区间[8]*/
};
#pragmapack()
成员总大小=9
2)整体对齐
整体对齐系数=min((max(int,short,char),8)=4
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/
5、16字节对齐(#pragmapack(16))
输出结果:sizeof(structtest_t)=12[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(16)
structtest_t{
inta;/*长度4<16按4对齐;起始offset=00%4=0;存放位置区间[0,3]*/
charb;/*长度1<16按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2<16按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/
chard;/*长度1<16按1对齐;起始offset=88%1=0;存放位置区间[8]*/
};
#pragmapack()
成员总大小=9
2)整体对齐
整体对齐系数=min((max(int,short,char),16)=4
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/
8字节和16字节对齐试验证明了“规则”的第3点:“当#pragmapack的n值等于或超过所有数据成员长度的时候,这个n值的大小将不产生任何效果”。
内存分配与内存对齐是个很复杂的东西,不但与具体实现密切相关,而且在不同的操作系统,编译器或硬件平台上规则也不尽相同,虽然目前大多数系统/语言都具有自动管理、分配并隐藏低层操作的功能,使得应用程序编写大为简单,程序员不在需要考虑详细的内存分配问题。但是,在系统或驱动级以至于高实时,高保密性的程序开发过程中,程序内存分配问题仍旧是保证整个程序稳定,安全,高效的基础。
[注1]
什么是“圆整”?
举例说明:如上面的8字节对齐中的“整体对齐”,整体大小=9
按4圆整=12
圆整的过程:从9开始每次加一,看是否能被4整除,这里9,10,11均不能被4整除,到12时可以,则圆整结束。
五作者
JonathanRentzschhttp://www.ibm.com/developerworks/library/pa-dalign/http://www.cppblog.com/cc/archive/2006/08/01/10765.html(中文优秀解释)http://www.cppblog.com/snailcong/archive/2009/03/16/76705.html(对英文版的消化,可以查看该博客)http://blogold.chinaunix.net/u3/118340/showart_2615855.html
题记
一
内存读取粒度
Memoryaccessgranularity
从内存的角度解释内存对齐的原理
队列原理Alignmentfundamentals
Lazyprocessors
二速度Speed(内存对齐的基本原理)
代码解释
中文代码及其内存解释
三不懂内存对齐将造成的可能影响如下
四内存对齐规划
内存对齐的原因
对齐规则
试验
五作者
题记
下面的文章中是我对四个博客文章的合成,非原创,解释了内存对齐的原因,作用(中英文说明),及其规划!尤其适用于对sizeof结构体。
首先解释了内存对齐的原理,然后对作用进行了说明,最后是例子!其中中文对内存对齐,原作者做了详细的说明及其例子解释,需要注意的是,如
struct
{
chara;
intb;
charb
}A;
a在分配时候占用其中一个字节,剩下3个,但是b分配的是4字节,明显3个字节无法满足,那么就需要另外写入队列
人觉得第二个中文作者(按我最后说明博客地址顺序)提到的最重要的是画图是一个很不错的方法.
我在引用的第四个博客中,也就是最后的博客中,通过详细的代码解释,说明了内存对齐的规划问题!
内存对齐在系统或驱动级别以至于高真实时,高保密的程序开发的时候,程序内存分配问题仍旧是保证整个程序稳定,安全,高效的基础。所以对内存对齐需要学会掌握!至少在CSDN能说的来头!
文章可能很杂,如果看不懂,可以直接浏览中文部分,虽然我的博客很少人来看。
--QQ124045670
一内存读取粒度
Memoryaccessgranularity
从内存的角度解释内存对齐的原理
Programmersareconditionedtothinkofmemoryasasimplearrayofbytes.AmongCanditsdescendants,
char*isubiquitousasmeaning"ablockofmemory",andevenJava?hasits
byte[]typetorepresentrawmemory.
Figure1.Howprogrammers
seememory
However,yourcomputer'sprocessordoesnotreadfromandwritetomemoryinbyte-sizedchunks.Instead,itaccessesmemoryintwo-,four-,eight-16-oreven32-bytechunks.We'llcallthesizeinwhichaprocessoraccessesmemory
itsmemoryaccessgranularity.
Figure2.Howprocessors
seememory
Thedifferencebetweenhowhigh-levelprogrammersthinkofmemoryandhowmodernprocessorsactuallyworkwithmemoryraisesinterestingissuesthatthisarticleexplores.
Ifyoudon'tunderstandandaddressalignmentissuesinyoursoftware,thefollowingscenarios,inincreasingorderofseverity,areallpossible:
Yoursoftwarewillrunslower.
Yourapplicationwilllockup.
Youroperatingsystemwillcrash.
Yoursoftwarewillsilentlyfail,yieldingincorrectresults.
队列原理
Alignmentfundamentals
Toillustratetheprinciplesbehindalignment,examineaconstanttask,andhowit'saffectedbyaprocessor'smemoryaccessgranularity.Thetaskissimple:firstreadfourbytesfromaddress0intotheprocessor'sregister.Then
readfourbytesfromaddress1intothesameregister.
Firstexaminewhatwouldhappenonaprocessorwithaone-bytememoryaccessgranularity:
Figure3.Single-byte
memoryaccessgranularity
Thisfitsinwiththenaiveprogrammer'smodelofhowmemoryworks:ittakesthesamefourmemoryaccessestoreadfromaddress0asitdoesfromaddress1.Nowseewhatwouldhappenonaprocessorwithtwo-bytegranularity,like
theoriginal68000:
Figure4.Double-byte
memoryaccessgranularity
Whenreadingfromaddress0,aprocessorwithtwo-bytegranularitytakeshalfthenumberofmemoryaccessesasaprocessorwithone-bytegranularity.Becauseeachmemoryaccessentailsafixedamountoverhead,minimizingthenumber
ofaccessescanreallyhelpperformance.
However,noticewhathappenswhenreadingfromaddress1.Becausetheaddressdoesn'tfallevenlyontheprocessor'smemoryaccessboundary,theprocessorhasextraworktodo.Suchanaddressisknownasan
unalignedaddress.Becauseaddress1isunaligned,aprocessorwithtwo-bytegranularitymustperformanextramemoryaccess,slowingdowntheoperation.
Finally,examinewhatwouldhappenonaprocessorwithfour-bytememoryaccessgranularity,likethe68030orPowerPC?601:
Figure5.Quad-bytememory
accessgranularity
Aprocessorwithfour-bytegranularitycanslurpupfourbytesfromanalignedaddresswithoneread.Alsonotethatreadingfromanunalignedaddressdoublestheaccesscount.
Nowthatyouunderstandthefundamentalsbehindaligneddataaccess,youcanexploresomeoftheissuesrelatedtoalignment.
Lazyprocessors
Aprocessorhastoperformsometrickswheninstructedtoaccessanunalignedaddress.Goingbacktotheexampleofreadingfourbytesfromaddress1onaprocessorwithfour-bytegranularity,youcanworkoutexactlywhatneeds
tobedone:
Figure6.Howprocessors
handleunalignedmemoryaccess
Theprocessorneedstoreadthefirstchunkoftheunalignedaddressandshiftoutthe"unwanted"bytesfromthefirstchunk.Thenitneedstoreadthesecondchunkoftheunalignedaddressandshiftoutsomeofitsinformation.
Finally,thetwoaremergedtogetherforplacementintheregister.It'salotofwork.
Someprocessorsjustaren'twillingtodoallofthatworkforyou.
Theoriginal68000wasaprocessorwithtwo-bytegranularityandlackedthecircuitrytocopewithunalignedaddresses.Whenpresentedwithsuchanaddress,theprocessorwouldthrowanexception.TheoriginalMacOSdidn'ttake
verykindlytothisexception,andwouldusuallydemandtheuserrestartthemachine.Ouch.
Laterprocessorsinthe680x0series,suchasthe68020,liftedthisrestrictionandperformedthenecessaryworkforyou.Thisexplainswhysomeoldsoftwarethatworksonthe68020crashesonthe68000.Italsoexplainswhy,way
backwhen,someoldMaccodersinitializedpointerswithoddaddresses.OntheoriginalMac,ifthepointerwasaccessedwithoutbeingreassignedtoavalidaddress,theMacwouldimmediatelydropintothedebugger.Oftentheycouldthenexaminethecalling
chainstackandfigureoutwherethemistakewas.
Allprocessorshaveafinitenumberoftransistorstogetworkdone.Addingunalignedaddressaccesssupportcutsintothis"transistorbudget."Thesetransistorscouldotherwisebeusedtomakeotherportionsoftheprocessorwork
faster,oraddnewfunctionalityaltogether.
AnexampleofaprocessorthatsacrificesunalignedaddressaccesssupportinthenameofspeedisMIPS.MIPSisagreatexampleofaprocessorthatdoesawaywithalmostallfrivolityinthenameofgettingrealworkdonefaster.
ThePowerPCtakesahybridapproach.EveryPowerPCprocessortodatehashardwaresupportforunaligned32-bitintegeraccess.Whileyoustillpayaperformancepenaltyforunalignedaccess,ittendstobesmall.
Ontheotherhand,modernPowerPCprocessorslackhardwaresupportforunaligned64-bitfloating-pointaccess.Whenaskedtoloadanunalignedfloating-pointnumberfrommemory,modernPowerPCprocessorswillthrowanexception
andhavetheoperatingsystemperformthealignmentchores
insoftware.Performingalignmentinsoftwareis
muchslowerthanperformingitinhardware.
二
速度Speed(内存对齐的基本原理)
内存对齐有一个好处是提高访问内存的速度,因为在许多数据结构中都需要占用内存,在很多系统中,要求内存分配的时候要对齐.下面是对为什么可以提高内存速度通过代码做了解释!
代码解释
Writingsometestsillustratestheperformancepenaltiesofunalignedmemoryaccess.Thetestissimple:youread,negate,andwritebackthenumbersinaten-megabytebuffer.Thesetestshavetwovariables:
Thesize,inbytes,inwhichyouprocessthebuffer.Firstyou'llprocessthebufferonebyteatatime.Thenyou'llmoveontotwo-,four-andeight-bytesatatime.
Thealignmentofthebuffer.You'llstaggerthealignmentofthebufferbyincrementingthepointertothebufferandrunningeachtestagain.
Thesetestswereperformedona800MHzPowerBookG4.Tohelpnormalizeperformancefluctuationsfrominterruptprocessing,eachtestwasruntentimes,keepingtheaverageoftheruns.Firstupisthetestthatoperatesonasingle
byteatatime:
Listing1.Mungingdataonebyteatatime
voidMunge8(void*data,uint32_tsize){ uint8_t*data8=(uint8_t*)data; uint8_t*data8End=data8+size; while(data8!=data8End){ *data8++=-*data8; } }
Ittookanaverageof67,364microsecondstoexecutethisfunction.Nowmodifyittoworkontwobytesatatimeinsteadofonebyteatatime--whichwillhalvethenumberofmemoryaccesses:
Listing2.Mungingdata
twobytesatatime
voidMunge16(void*data,uint32_tsize){ uint16_t*data16=(uint16_t*)data; uint16_t*data16End=data16+(size>>1);/*Dividesizeby2.*/ uint8_t*data8=(uint8_t*)data16End; uint8_t*data8End=data8+(size&0x00000001);/*Stripupper31bits.*/ while(data16!=data16End){ *data16++=-*data16; } while(data8!=data8End){ *data8++=-*data8; } }
Thisfunctiontook48,765microsecondstoprocessthesameten-megabytebuffer--38%fasterthanMunge8.However,thatbufferwasaligned.Ifthebufferisunaligned,thetimerequiredincreasesto66,385microseconds--about
a27%speedpenalty.Thefollowingchartillustratestheperformancepatternofalignedmemoryaccessesversusunalignedaccesses:
Figure7.Single-byte
accessversusdouble-byteaccess
Thefirstthingyounoticeisthataccessingmemoryonebyteatatimeisuniformlyslow.Theseconditemofinterestisthatwhenaccessingmemorytwobytesatatime,whenevertheaddressisnotevenlydivisiblebytwo,that27%
speedpenaltyrearsitsuglyhead.
Nowuptheante,andprocessthebufferfourbytesatatime:
Listing3.Mungingdata
fourbytesatatime
voidMunge16(void*data,uint32_tsize){ uint16_t*data16=(uint16_t*)data; uint16_t*data16End=data16+(size>>1);/*Dividesizeby2.*/ uint8_t*data8=(uint8_t*)data16End; uint8_t*data8End=data8+(size&0x00000001);/*Stripupper31bits.*/ while(data16!=data16End){ *data16++=-*data16; } while(data8!=data8End){ *data8++=-*data8; } }
Thisfunctionprocessesanalignedbufferin43,043microsecondsandanunalignedbufferin55,775microseconds,respectively.Thus,onthistestmachine,accessingunalignedmemoryfourbytesatatimeis
slowerthanaccessingalignedmemorytwobytesatatime:
Figure8.Single-versus
double-versusquad-byteaccess
Nowforthehorrorstory:processingthebuffereightbytesatatime.
Listing4.Mungingdata
eightbytesatatime
voidMunge32(void*data,uint32_tsize){ uint32_t*data32=(uint32_t*)data; uint32_t*data32End=data32+(size>>2);/*Dividesizeby4.*/ uint8_t*data8=(uint8_t*)data32End; uint8_t*data8End=data8+(size&0x00000003);/*Stripupper30bits.*/ while(data32!=data32End){ *data32++=-*data32; } while(data8!=data8End){ *data8++=-*data8; } }
Munge64processesanalignedbufferin39,085microseconds--about10%fasterthanprocessingthebufferfourbytesatatime.However,processinganunalignedbuffer
takesanamazing1,841,155microseconds--twoordersofmagnitudeslowerthanalignedaccess,anoutstanding4,610%performancepenalty!
Whathappened?BecausemodernPowerPCprocessorslackhardwaresupportforunalignedfloating-pointaccess,theprocessorthrowsanexception
foreachunalignedaccess.Theoperatingsystemcatchesthisexceptionandperformsthealignmentinsoftware.Here'sachartillustratingthepenalty,andwhenitoccurs:
Figure9.Multiple-byte
accesscomparison
Thepenaltiesforone-,two-andfour-byteunalignedaccessaredwarfedbythehorrendousunalignedeight-bytepenalty.Maybethischart,removingthetop(andthusthetremendousgulfbetweenthetwonumbers),willbeclearer:
Figure10.Multiple-byte
accesscomparison#2
There'sanothersubtleinsighthiddeninthisdata.Compareeight-byteaccessspeedsonfour-byteboundaries:
Figure11.Multiple-byte
accesscomparison#3
Noticeaccessingmemoryeightbytesatatimeonfour-andtwelve-byteboundaries
isslowerthanreadingthesamememoryfouroreventwobytesatatime.WhilePowerPCshavehardwaresupportforfour-bytealignedeight-bytedoubles,youstillpayaperformancepenaltyifyouusethatsupport.Granted,it's
nowherenearthe4,610%penalty,butit'scertainlynoticeable.Moralofthestory:accessingmemoryinlargechunkscanbeslowerthanaccessingmemoryinsmallchunks,ifthataccessisnotaligned.
Atomicity
Allmodernprocessorsofferatomicinstructions.Thesespecialinstructionsarecrucialforsynchronizingtwoormoreconcurrenttasks.Asthenameimplies,atomicinstructionsmustbe
indivisible--that'swhythey'resohandyforsynchronization:theycan'tbepreempted.
Itturnsoutthatinorderforatomicinstructionstoperformcorrectly,theaddressesyoupassthemmustbeatleastfour-bytealigned.Thisisbecauseofasubtleinteractionbetweenatomicinstructionsandvirtualmemory.
Ifanaddressisunaligned,itrequiresatleasttwomemoryaccesses.Butwhathappensifthedesireddataspanstwopagesofvirtualmemory?Thiscouldleadtoasituationwherethefirstpageisresidentwhilethelastpageis
not.Uponaccess,inthemiddleoftheinstruction,apagefaultwouldbegenerated,executingthevirtualmemorymanagementswap-incode,destroyingtheatomicityoftheinstruction.Tokeepthingssimpleandcorrect,boththe68KandPowerPCrequirethat
atomicallymanipulatedaddressesalwaysbeatleastfour-bytealigned.
Unfortunately,thePowerPCdoesnotthrowanexceptionwhenatomicallystoringtoanunalignedaddress.Instead,thestoresimplyalwaysfails.Thisisbadbecausemostatomicfunctionsarewrittentoretryuponafailedstore,
undertheassumptiontheywerepreempted.Thesetwocircumstancescombinetowhereyourprogramwillgointoaninfiniteloopifyouattempttoatomicallystoretoanunalignedaddress.Oops.
Altivec
Altivecisallaboutspeed.Unalignedmemoryaccessslowsdowntheprocessorandcostsprecioustransistors.Thus,theAltivecengineerstookapagefromtheMIPSplaybookandsimplydon'tsupportunalignedmemoryaccess.Because
Altivecworkswithsixteen-bytechunksatatime,alladdressespassedtoAltivecmustbesixteen-bytealigned.What'sscaryiswhathappensifyouraddressisnotaligned.
Altivecwon'tthrowanexceptiontowarnyouabouttheunalignedaddress.Instead,Altivecsimplyignoresthelowerfourbitsoftheaddressandchargesahead,
operatingonthewrongaddress.Thismeansyourprogrammaysilentlycorruptmemoryorreturnincorrectresultsifyoudon'texplicitlymakesureallyourdataisaligned.
ThereisanadvantagetoAltivec'sbit-strippingways.Becauseyoudon'tneedtoexplicitlytruncate(align-down)anaddress,thisbehaviorcansaveyouaninstructionortwowhenhandingaddressestotheprocessor.
ThisisnottosayAltiveccan'tprocessunalignedmemory.Youcanfinddetailedinstructionshowtodosoonthe
AltivecProgrammingEnvironmentsManual(see
theprocessor,theoverheadforsuchshenanigansissurprisinglylow.
Structurealignment
Examinethefollowingstructure:
Listing5.Aninnocent
structure
voidMunge64(void*data,uint32_tsize){
typedefstruct{
chara;
longb;
charc;
}Struct;
Whatisthesizeofthisstructureinbytes?Manyprogrammerswillanswer"6bytes."Itmakessense:onebytefor
a,fourbytesfor
bandanotherbytefor
c.1+4+1equals6.Here'showitwouldlayoutinmemory:
FieldType | FieldName | FieldOffset | FieldSize | FieldEnd |
char | a | 0 | 1 | 1 |
long | b | 1 | 4 | 5 |
char | c | 5 | 1 | 6 |
TotalSizeinBytes: | 6 |
sizeof(Struct),chancesaretheansweryou'dgetbackwouldbegreaterthansix,perhapseightoreventwenty-four.There'stworeasonsforthis:backwardscompatibilityandefficiency.
First,backwardscompatibility.Rememberthe68000wasaprocessorwithtwo-bytememoryaccessgranularity,andwouldthrowanexceptionuponencounteringanoddaddress.Ifyouweretoreadfromorwritetofield
b,you'dattempttoaccessanoddaddress.Ifadebuggerweren'tinstalled,theoldMacOSwouldthrowupaSystemErrordialogboxwithonebutton:Restart.Yikes!
So,insteadoflayingoutyourfieldsjustthewayyouwrotethem,thecompiler
paddedthestructuresothat
band
cwouldresideatevenaddresses:
FieldType | FieldName | FieldOffset | FieldSize | FieldEnd |
char | a | 0 | 1 | 1 |
padding | 1 | 1 | 2 | |
long | b | 2 | 4 | 6 |
char | c | 6 | 1 | 7 |
padding | 7 | 1 | 8 | |
TotalSizeinBytes: | 8 |
itdidn'thurtanything,anditevenhelpedalittleinperformance.
Thesecondreasonisefficiency.Nowadays,onPowerPCmachines,two-bytealignmentisnice,butfour-byteoreight-byteisbetter.Youprobablydon'tcareanymorethattheoriginal68000chokedonunalignedstructures,butyouprobably
careaboutpotential4,610%performancepenalties,whichcanhappenifa
doublefielddoesn'tsitalignedinastructureofyourdevising.
中文代码及其内存解释
内存对齐关键是需要画图!在下面的中文有说明例子
Examinethefollowingstructure:
如果英文看不懂,那么可以直接用中文例如(http://www.cppblog.com/snailcong/archive/2009/03/16/76705.html)来说!
首先由一个程序引入话题:
1//环境:vc6+windowssp2
2//程序1
3#include<iostream>
4
5usingnamespacestd;
6
7structst1
8{
9chara;
10intb;
11shortc;
12};
13
14structst2
15{
16shortc;
17chara;
18intb;
19};
20
21intmain()
22{
23cout<<"sizeof(st1)is"<<sizeof(st1)<<endl;
24cout<<"sizeof(st2)is"<<sizeof(st2)<<endl;
25return0;
26}
27
程序的输出结果为:
sizeof(st1)is12
sizeof(st2)is8
问题出来了,这两个一样的结构体,为什么sizeof的时候大小不一样呢?
本文的主要目的就是解释明白这一问题。
内存对齐,正是因为内存对齐的影响,导致结果不同。
对于大多数的程序员来说,内存对齐基本上是透明的,这是编译器该干的活,编译器为程序中的每个数据单元安排在合适的位置上,从而导致了相同的变量,不同声明顺序的结构体大小的不同。
那么编译器为什么要进行内存对齐呢?程序1中结构体按常理来理解sizeof(st1)和sizeof(st2)结果都应该是7,4(int)+2(short)+1(char)=7。经过内存对齐后,结构体的空间反而增大了。
在解释内存对齐的作用前,先来看下内存对齐的规则:
1、对于结构的各个成员,第一个成员位于偏移为0的位置,以后每个数据成员的偏移量必须是min(#pragmapack()指定的数,这个数据成员的自身长度)的倍数。
2、在数据成员完成各自对齐之后,结构(或联合)本身也要进行对齐,对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中,比较小的那个进行。
#pragmapack(n)表示设置为n字节对齐。VC6默认8字节对齐
以程序1为例解释对齐的规则:
St1:char占一个字节,起始偏移为0,int占4个字节,min(#pragmapack()指定的数,这个数据成员的自身长度)=4(VC6默认8字节对齐),所以int按4字节对齐,起始偏移必须为4的倍数,所以起始偏移为4,在char后编译器会添加3个字节的额外字节,不存放任意数据。short占2个字节,按2字节对齐,起始偏移为8,正好是2的倍数,无须添加额外字节。到此规则1的数据成员对齐结束,此时的内存状态为:
oxxx|oooo|oo
0123456789(地址)
(x表示额外添加的字节)
共占10个字节。还要继续进行结构本身的对齐,对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中,比较小的那个进行,st1结构中最大数据成员长度为int,占4字节,而默认的#pragmapack指定的值为8,所以结果本身按照4字节对齐,结构总大小必须为4的倍数,需添加2个额外字节使结构的总大小为12。此时的内存状态为:
oxxx|oooo|ooxx
0123456789ab(地址)
到此内存对齐结束。St1占用了12个字节而非7个字节。
St2的对齐方法和st1相同,读者可自己完成。
下面再看一个例子
内存对齐
在我们的程序中,数据结构还有变量等等都需要占有内存,在很多系统中,它都要求内存分配的时候要对齐,这样做的好处就是可以提高访问内存的速度。
我们还是先来看一段简单的程序:
程序一
1#include<iostream>
2usingnamespacestd;
3
4structX1
5{
6inti;//4个字节
7charc1;//1个字节
8charc2;//1个字节
9};
10
11structX2
12{
13charc1;//1个字节
14inti;//4个字节
15charc2;//1个字节
16};
17
18structX3
19{
20charc1;//1个字节
21charc2;//1个字节
22inti;//4个字节
23};
24intmain()
25{
26cout<<"long"<<sizeof(long)<<"\n";
27cout<<"float"<<sizeof(float)<<"\n";
28cout<<"int"<<sizeof(int)<<"\n";
29cout<<"char"<<sizeof(char)<<"\n";
30
31X1x1;
32X2x2;
33X3x3;
34cout<<"x1的大小"<<sizeof(x1)<<"\n";
35cout<<"x2的大小"<<sizeof(x2)<<"\n";
36cout<<"x3的大小"<<sizeof(x3)<<"\n";
37return0;
38}
这段程序的功能很简单,就是定义了三个结构X1,X2,X3,这三个结构的主要区别就是内存数据摆放的顺序,其他都是一样的,另外程序输入了几种基本类型所占用的字节数,以及我们这里的三个结构所占用的字节数。
这段程序的运行结果为:
1long4
2float4
3int4
4char1
5x1的大小8
6x2的大小12
7x3的大小8
结果的前面四行没有什么问题,但是我们在最后三行就可以看到三个结构占用的空间大小不一样,造成这个原因就是内部数据的摆放顺序,怎么会这样呢?
下面就是我们需要讲的内存对齐了。
内存是一个连续的块,我们可以用下面的图来表示,它是以4个字节对一个对齐单位的:
图一
让我们看看三个结构在内存中的布局:
首先是X1,如下图所示
X1中第一个是Int类型,它占有4字节,所以前面4格就是满了,然后第二个是char类型,这中类型只占一个字节,所以它占有了第二个4字节组块中的第一格,第三个也是char类型,所以它也占用一个字节,它就排在了第二个组块的第二格,因为它们加在一起大小也不超过一个块,所以他们三个变量在内存中的结构就是这样的,因为有内存分块对齐,所以最后出来的结果是8,而不是6,因为后面两个格子其实也算是被用了。
再次看看X2,如图所示
X2中第一个类型是Char类型,它占用一个字节,所以它首先排在第一组块的第一个格子里面,第二个是Int类型,它占用4个字节,第一组块已经用掉一格,还剩3格,肯定是无法放下第二Int类型的,因为要考虑到对齐,所以不得不把它放到第二个组块,第三个类型是Char类型,跟第一个类似。所因为有内存分块对齐,我们的内存就不是8个格子了,而是12个了。
再看看X3,如下图所示:
关于X3的说明其实跟X1是类似的,只不过它把两个1个字节的放到了前面,相信看了前面两种情况的说明这里也是很容易理解的。
Whatisthesizeofthisstructureinbytes?Manyprogrammerswillanswer"6bytes."Itmakessense:onebytefor
a,fourbytesfor
bandanotherbytefor
c.1+4+1equals6.Here'showitwouldlayoutinmemory:
FieldType | FieldName | FieldOffset | FieldSize | FieldEnd |
char | a | 0 | 1 | 1 |
long | b | 1 | 4 | 5 |
char | c | 5 | 1 | 6 |
TotalSizeinBytes: | 6 |
sizeof(Struct),chancesaretheansweryou'dgetbackwouldbegreaterthansix,perhapseightoreventwenty-four.There'stworeasonsforthis:backwardscompatibilityandefficiency.
First,backwardscompatibility.Rememberthe68000wasaprocessorwithtwo-bytememoryaccessgranularity,andwouldthrowanexceptionuponencounteringanoddaddress.Ifyouweretoreadfromor
writetofield
b,you'dattempttoaccessanoddaddress.Ifadebuggerweren'tinstalled,theoldMacOSwouldthrowupaSystemErrordialogboxwithonebutton:Restart.Yikes!
So,insteadoflayingoutyourfieldsjustthewayyouwrotethem,thecompiler
paddedthestructuresothat
band
cwouldresideatevenaddresses:
FieldType | FieldName | FieldOffset | FieldSize | FieldEnd |
char | a | 0 | 1 | 1 |
padding | 1 | 1 | 2 | |
long | b | 2 | 4 | 6 |
char | c | 6 | 1 | 7 |
padding | 7 | 1 | 8 | |
TotalSizeinBytes: | 8 |
thispaddingwasunnecessary.However,itdidn'thurtanything,anditevenhelpedalittleinperformance.
Thesecondreasonisefficiency.Nowadays,onPowerPCmachines,two-bytealignmentisnice,butfour-byteoreight-byteisbetter.Youprobablydon'tcareanymorethattheoriginal68000chokedon
unalignedstructures,butyouprobablycareaboutpotential4,610%performancepenalties,whichcanhappenifa
doublefielddoesn'tsitalignedinastructureofyourdevising.
[code]很多人都知道是内存对齐所造成的原因,却鲜有人告诉你内存对齐的基本原理!上面作者就做了解释!
三
不懂内存对齐将造成的可能影响如下
Yoursoftwaremayhitperformance-killingunalignedmemoryaccessexceptions,whichinvoke
veryexpensivealignmentexceptionhandlers.
Yourapplicationmayattempttoatomicallystoretoanunalignedaddress,causingyourapplicationtolockup.
YourapplicationmayattempttopassanunalignedaddresstoAltivec,resultinginAltivecreadingfromand/orwritingtothewrongpartofmemory,silentlycorruptingdataoryieldingincorrectresults.
四
内存对齐规划
一、内存对齐的原因
大部分的参考资料都是如是说的:
1、平台原因(移植原因):不是所有的硬件平台都能访问任意地址上的任意数据的;某些硬件平台只能在某些地址处取某些特定类型的数据,否则抛出硬件异常。
2、性能原因:数据结构(尤其是栈)应该尽可能地在自然边界上对齐。原因在于,为了访问未对齐的内存,处理器需要作两次内存访问;而对齐的内存访问仅需要一次访问。
二、对齐规则
每个特定平台上的编译器都有自己的默认“对齐系数”(也叫对齐模数)。程序员可以通过预编译命令#pragmapack(n),n=1,2,4,8,16来改变这一系数,其中的n就是你要指定的“对齐系数”。
规则:
1、数据成员对齐规则:结构(struct)(或联合(union))的数据成员,第一个数据成员放在offset为0的地方,以后每个数据成员的对齐按照#pragmapack指定的数值和这个数据成员
自身长度中,比较小的那个进行。
2、结构(或联合)的整体对齐规则:在数据成员完成各自对齐之后,结构(或联合)本身也要进行对齐,对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中,比较小的那个进行。
3、结合1、2可推断:当#pragmapack的n值等于或超过所有数据成员长度的时候,这个n值的大小将不产生任何效果。
三、试验
下面我们通过一系列例子的详细说明来证明这个规则
编译器:GCC3.4.2、VC6.0
平台:WindowsXP
典型的struct对齐
struct定义:
#pragmapack(n)/*n=1,2,4,8,16*/
structtest_t{
inta;
charb;
shortc;
chard;
};
#pragmapack(n)
首先确认在试验平台上的各个类型的size,经验证两个编译器的输出均为:
sizeof(char)=1
sizeof(short)=2
sizeof(int)=4
试验过程如下:通过#pragmapack(n)改变“对齐系数”,然后察看sizeof(structtest_t)的值。
1、1字节对齐(#pragmapack(1))
输出结果:sizeof(structtest_t)=8[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(1)
structtest_t{
inta;/*长度4>1按1对齐;起始offset=00%1=0;存放位置区间[0,3]*/
charb;/*长度1=1按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2>1按1对齐;起始offset=55%1=0;存放位置区间[5,6]*/
chard;/*长度1=1按1对齐;起始offset=77%1=0;存放位置区间[7]*/
};
#pragmapack()
成员总大小=8
2)整体对齐
整体对齐系数=min((max(int,short,char),1)=1
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=8/*8%1=0*/[注1]
2、2字节对齐(#pragmapack(2))
输出结果:sizeof(structtest_t)=10[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(2)
structtest_t{
inta;/*长度4>2按2对齐;起始offset=00%2=0;存放位置区间[0,3]*/
charb;/*长度1<2按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2=2按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/
chard;/*长度1<2按1对齐;起始offset=88%1=0;存放位置区间[8]*/
};
#pragmapack()
成员总大小=9
2)整体对齐
整体对齐系数=min((max(int,short,char),2)=2
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=10/*10%2=0*/
3、4字节对齐(#pragmapack(4))
输出结果:sizeof(structtest_t)=12[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(4)
structtest_t{
inta;/*长度4=4按4对齐;起始offset=00%4=0;存放位置区间[0,3]*/
charb;/*长度1<4按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2<4按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/
chard;/*长度1<4按1对齐;起始offset=88%1=0;存放位置区间[8]*/
};
#pragmapack()
成员总大小=9
2)整体对齐
整体对齐系数=min((max(int,short,char),4)=4
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/
4、8字节对齐(#pragmapack(8))
输出结果:sizeof(structtest_t)=12[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(8)
structtest_t{
inta;/*长度4<8按4对齐;起始offset=00%4=0;存放位置区间[0,3]*/
charb;/*长度1<8按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2<8按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/
chard;/*长度1<8按1对齐;起始offset=88%1=0;存放位置区间[8]*/
};
#pragmapack()
成员总大小=9
2)整体对齐
整体对齐系数=min((max(int,short,char),8)=4
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/
5、16字节对齐(#pragmapack(16))
输出结果:sizeof(structtest_t)=12[两个编译器输出一致]
分析过程:
1)成员数据对齐
#pragmapack(16)
structtest_t{
inta;/*长度4<16按4对齐;起始offset=00%4=0;存放位置区间[0,3]*/
charb;/*长度1<16按1对齐;起始offset=44%1=0;存放位置区间[4]*/
shortc;/*长度2<16按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/
chard;/*长度1<16按1对齐;起始offset=88%1=0;存放位置区间[8]*/
};
#pragmapack()
成员总大小=9
2)整体对齐
整体对齐系数=min((max(int,short,char),16)=4
整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/
8字节和16字节对齐试验证明了“规则”的第3点:“当#pragmapack的n值等于或超过所有数据成员长度的时候,这个n值的大小将不产生任何效果”。
内存分配与内存对齐是个很复杂的东西,不但与具体实现密切相关,而且在不同的操作系统,编译器或硬件平台上规则也不尽相同,虽然目前大多数系统/语言都具有自动管理、分配并隐藏低层操作的功能,使得应用程序编写大为简单,程序员不在需要考虑详细的内存分配问题。但是,在系统或驱动级以至于高实时,高保密性的程序开发过程中,程序内存分配问题仍旧是保证整个程序稳定,安全,高效的基础。
[注1]
什么是“圆整”?
举例说明:如上面的8字节对齐中的“整体对齐”,整体大小=9
按4圆整=12
圆整的过程:从9开始每次加一,看是否能被4整除,这里9,10,11均不能被4整除,到12时可以,则圆整结束。
五作者
JonathanRentzsch
相关文章推荐
- 多态机制原理解析--从内存角度分析
- 从两个角度解释电容退耦原理
- 关于内存对齐的全面详细解释
- C++ 中内存对齐原理及作用
- C语言内存对齐原理
- 从两个角度解释电容退耦原理
- 从两个角度解释电容退耦原理
- 数据对齐的理解与一道组成原理题的详细解释
- 内存对齐原理
- 从两个角度解释电容退耦原理
- 从CPU角度看内存访问对齐
- C++中怎么求类的大小?以及内存对齐原理(面试官经常问到的问题)
- Linux中_ALIGN宏背后的原理(ZZ) -- 内存对齐
- 内存对齐详细解释
- Linux中_ALIGN宏背后的原理(ZZ) -- 内存对齐
- 反汇编角度解释C++语言中引用的原理
- copy,retain从内存的角度去阐释实现原理(2)
- Linux中_ALIGN宏背后的原理――内存对齐
- 结构体大小的计算及设置内存字节对齐数原理理解
- 烦请哪位老大可以介绍一下C语言中的内存对齐的原理和实现?