您的位置:首页 > 其它

从内存的角度解释内存对齐的原理

2012-03-16 22:08 218 查看
目录
题记


内存读取粒度
Memoryaccessgranularity

从内存的角度解释内存对齐的原理

队列原理Alignmentfundamentals

Lazyprocessors

二速度Speed(内存对齐的基本原理)

代码解释

中文代码及其内存解释

三不懂内存对齐将造成的可能影响如下

四内存对齐规划

内存对齐的原因

对齐规则

试验

五作者

题记
下面的文章中是我对四个博客文章的合成,非原创,解释了内存对齐的原因,作用(中英文说明),及其规划!尤其适用于对sizeof结构体。
首先解释了内存对齐的原理,然后对作用进行了说明,最后是例子!其中中文对内存对齐,原作者做了详细的说明及其例子解释,需要注意的是,如

struct
{
chara;
intb;
charb
}A;

a在分配时候占用其中一个字节,剩下3个,但是b分配的是4字节,明显3个字节无法满足,那么就需要另外写入队列
人觉得第二个中文作者(按我最后说明博客地址顺序)提到的最重要的是画图是一个很不错的方法.
我在引用的第四个博客中,也就是最后的博客中,通过详细的代码解释,说明了内存对齐的规划问题!

内存对齐在系统或驱动级别以至于高真实时,高保密的程序开发的时候,程序内存分配问题仍旧是保证整个程序稳定,安全,高效的基础。所以对内存对齐需要学会掌握!至少在CSDN能说的来头!

文章可能很杂,如果看不懂,可以直接浏览中文部分,虽然我的博客很少人来看。
--QQ124045670

一内存读取粒度
Memoryaccessgranularity

从内存的角度解释内存对齐的原理

Programmersareconditionedtothinkofmemoryasasimplearrayofbytes.AmongCanditsdescendants,
char*
isubiquitousasmeaning"ablockofmemory",andevenJava?hasits
byte[]
typetorepresentrawmemory.

Figure1.Howprogrammers
seememory




However,yourcomputer'sprocessordoesnotreadfromandwritetomemoryinbyte-sizedchunks.Instead,itaccessesmemoryintwo-,four-,eight-16-oreven32-bytechunks.We'llcallthesizeinwhichaprocessoraccessesmemory
itsmemoryaccessgranularity.

Figure2.Howprocessors
seememory




Thedifferencebetweenhowhigh-levelprogrammersthinkofmemoryandhowmodernprocessorsactuallyworkwithmemoryraisesinterestingissuesthatthisarticleexplores.

Ifyoudon'tunderstandandaddressalignmentissuesinyoursoftware,thefollowingscenarios,inincreasingorderofseverity,areallpossible:

Yoursoftwarewillrunslower.

Yourapplicationwilllockup.

Youroperatingsystemwillcrash.

Yoursoftwarewillsilentlyfail,yieldingincorrectresults.

队列原理
Alignmentfundamentals

Toillustratetheprinciplesbehindalignment,examineaconstanttask,andhowit'saffectedbyaprocessor'smemoryaccessgranularity.Thetaskissimple:firstreadfourbytesfromaddress0intotheprocessor'sregister.Then
readfourbytesfromaddress1intothesameregister.

Firstexaminewhatwouldhappenonaprocessorwithaone-bytememoryaccessgranularity:

Figure3.Single-byte
memoryaccessgranularity




Thisfitsinwiththenaiveprogrammer'smodelofhowmemoryworks:ittakesthesamefourmemoryaccessestoreadfromaddress0asitdoesfromaddress1.Nowseewhatwouldhappenonaprocessorwithtwo-bytegranularity,like
theoriginal68000:

Figure4.Double-byte
memoryaccessgranularity




Whenreadingfromaddress0,aprocessorwithtwo-bytegranularitytakeshalfthenumberofmemoryaccessesasaprocessorwithone-bytegranularity.Becauseeachmemoryaccessentailsafixedamountoverhead,minimizingthenumber
ofaccessescanreallyhelpperformance.

However,noticewhathappenswhenreadingfromaddress1.Becausetheaddressdoesn'tfallevenlyontheprocessor'smemoryaccessboundary,theprocessorhasextraworktodo.Suchanaddressisknownasan
unalignedaddress.Becauseaddress1isunaligned,aprocessorwithtwo-bytegranularitymustperformanextramemoryaccess,slowingdowntheoperation.

Finally,examinewhatwouldhappenonaprocessorwithfour-bytememoryaccessgranularity,likethe68030orPowerPC?601:

Figure5.Quad-bytememory
accessgranularity




Aprocessorwithfour-bytegranularitycanslurpupfourbytesfromanalignedaddresswithoneread.Alsonotethatreadingfromanunalignedaddressdoublestheaccesscount.

Nowthatyouunderstandthefundamentalsbehindaligneddataaccess,youcanexploresomeoftheissuesrelatedtoalignment.

Lazyprocessors

Aprocessorhastoperformsometrickswheninstructedtoaccessanunalignedaddress.Goingbacktotheexampleofreadingfourbytesfromaddress1onaprocessorwithfour-bytegranularity,youcanworkoutexactlywhatneeds
tobedone:

Figure6.Howprocessors
handleunalignedmemoryaccess




Theprocessorneedstoreadthefirstchunkoftheunalignedaddressandshiftoutthe"unwanted"bytesfromthefirstchunk.Thenitneedstoreadthesecondchunkoftheunalignedaddressandshiftoutsomeofitsinformation.
Finally,thetwoaremergedtogetherforplacementintheregister.It'salotofwork.

Someprocessorsjustaren'twillingtodoallofthatworkforyou.

Theoriginal68000wasaprocessorwithtwo-bytegranularityandlackedthecircuitrytocopewithunalignedaddresses.Whenpresentedwithsuchanaddress,theprocessorwouldthrowanexception.TheoriginalMacOSdidn'ttake
verykindlytothisexception,andwouldusuallydemandtheuserrestartthemachine.Ouch.

Laterprocessorsinthe680x0series,suchasthe68020,liftedthisrestrictionandperformedthenecessaryworkforyou.Thisexplainswhysomeoldsoftwarethatworksonthe68020crashesonthe68000.Italsoexplainswhy,way
backwhen,someoldMaccodersinitializedpointerswithoddaddresses.OntheoriginalMac,ifthepointerwasaccessedwithoutbeingreassignedtoavalidaddress,theMacwouldimmediatelydropintothedebugger.Oftentheycouldthenexaminethecalling
chainstackandfigureoutwherethemistakewas.

Allprocessorshaveafinitenumberoftransistorstogetworkdone.Addingunalignedaddressaccesssupportcutsintothis"transistorbudget."Thesetransistorscouldotherwisebeusedtomakeotherportionsoftheprocessorwork
faster,oraddnewfunctionalityaltogether.

AnexampleofaprocessorthatsacrificesunalignedaddressaccesssupportinthenameofspeedisMIPS.MIPSisagreatexampleofaprocessorthatdoesawaywithalmostallfrivolityinthenameofgettingrealworkdonefaster.

ThePowerPCtakesahybridapproach.EveryPowerPCprocessortodatehashardwaresupportforunaligned32-bitintegeraccess.Whileyoustillpayaperformancepenaltyforunalignedaccess,ittendstobesmall.

Ontheotherhand,modernPowerPCprocessorslackhardwaresupportforunaligned64-bitfloating-pointaccess.Whenaskedtoloadanunalignedfloating-pointnumberfrommemory,modernPowerPCprocessorswillthrowanexception
andhavetheoperatingsystemperformthealignmentchores
insoftware.Performingalignmentinsoftwareis
muchslowerthanperformingitinhardware.


速度Speed(内存对齐的基本原理)

内存对齐有一个好处是提高访问内存的速度,因为在许多数据结构中都需要占用内存,在很多系统中,要求内存分配的时候要对齐.下面是对为什么可以提高内存速度通过代码做了解释!
代码解释

Writingsometestsillustratestheperformancepenaltiesofunalignedmemoryaccess.Thetestissimple:youread,negate,andwritebackthenumbersinaten-megabytebuffer.Thesetestshavetwovariables:

Thesize,inbytes,inwhichyouprocessthebuffer.Firstyou'llprocessthebufferonebyteatatime.Thenyou'llmoveontotwo-,four-andeight-bytesatatime.

Thealignmentofthebuffer.You'llstaggerthealignmentofthebufferbyincrementingthepointertothebufferandrunningeachtestagain.

Thesetestswereperformedona800MHzPowerBookG4.Tohelpnormalizeperformancefluctuationsfrominterruptprocessing,eachtestwasruntentimes,keepingtheaverageoftheruns.Firstupisthetestthatoperatesonasingle
byteatatime:
Listing1.Mungingdataonebyteatatime

voidMunge8(void*data,uint32_tsize){
uint8_t*data8=(uint8_t*)data;
uint8_t*data8End=data8+size;

while(data8!=data8End){
*data8++=-*data8;
}
}



Ittookanaverageof67,364microsecondstoexecutethisfunction.Nowmodifyittoworkontwobytesatatimeinsteadofonebyteatatime--whichwillhalvethenumberofmemoryaccesses:

Listing2.Mungingdata
twobytesatatime


voidMunge16(void*data,uint32_tsize){
uint16_t*data16=(uint16_t*)data;
uint16_t*data16End=data16+(size>>1);/*Dividesizeby2.*/
uint8_t*data8=(uint8_t*)data16End;
uint8_t*data8End=data8+(size&0x00000001);/*Stripupper31bits.*/

while(data16!=data16End){
*data16++=-*data16;
}
while(data8!=data8End){
*data8++=-*data8;
}
}


Thisfunctiontook48,765microsecondstoprocessthesameten-megabytebuffer--38%fasterthanMunge8.However,thatbufferwasaligned.Ifthebufferisunaligned,thetimerequiredincreasesto66,385microseconds--about
a27%speedpenalty.Thefollowingchartillustratestheperformancepatternofalignedmemoryaccessesversusunalignedaccesses:

Figure7.Single-byte
accessversusdouble-byteaccess




Thefirstthingyounoticeisthataccessingmemoryonebyteatatimeisuniformlyslow.Theseconditemofinterestisthatwhenaccessingmemorytwobytesatatime,whenevertheaddressisnotevenlydivisiblebytwo,that27%
speedpenaltyrearsitsuglyhead.

Nowuptheante,andprocessthebufferfourbytesatatime:

Listing3.Mungingdata
fourbytesatatime


voidMunge16(void*data,uint32_tsize){
uint16_t*data16=(uint16_t*)data;
uint16_t*data16End=data16+(size>>1);/*Dividesizeby2.*/
uint8_t*data8=(uint8_t*)data16End;
uint8_t*data8End=data8+(size&0x00000001);/*Stripupper31bits.*/

while(data16!=data16End){
*data16++=-*data16;
}
while(data8!=data8End){
*data8++=-*data8;
}
}


Thisfunctionprocessesanalignedbufferin43,043microsecondsandanunalignedbufferin55,775microseconds,respectively.Thus,onthistestmachine,accessingunalignedmemoryfourbytesatatimeis
slowerthanaccessingalignedmemorytwobytesatatime:

Figure8.Single-versus
double-versusquad-byteaccess




Nowforthehorrorstory:processingthebuffereightbytesatatime.

Listing4.Mungingdata
eightbytesatatime


voidMunge32(void*data,uint32_tsize){
uint32_t*data32=(uint32_t*)data;
uint32_t*data32End=data32+(size>>2);/*Dividesizeby4.*/
uint8_t*data8=(uint8_t*)data32End;
uint8_t*data8End=data8+(size&0x00000003);/*Stripupper30bits.*/

while(data32!=data32End){
*data32++=-*data32;
}
while(data8!=data8End){
*data8++=-*data8;
}
}


Munge64
processesanalignedbufferin39,085microseconds--about10%fasterthanprocessingthebufferfourbytesatatime.However,processinganunalignedbuffer
takesanamazing1,841,155microseconds--twoordersofmagnitudeslowerthanalignedaccess,anoutstanding4,610%performancepenalty!

Whathappened?BecausemodernPowerPCprocessorslackhardwaresupportforunalignedfloating-pointaccess,theprocessorthrowsanexception
foreachunalignedaccess.Theoperatingsystemcatchesthisexceptionandperformsthealignmentinsoftware.Here'sachartillustratingthepenalty,andwhenitoccurs:

Figure9.Multiple-byte
accesscomparison




Thepenaltiesforone-,two-andfour-byteunalignedaccessaredwarfedbythehorrendousunalignedeight-bytepenalty.Maybethischart,removingthetop(andthusthetremendousgulfbetweenthetwonumbers),willbeclearer:

Figure10.Multiple-byte
accesscomparison#2




There'sanothersubtleinsighthiddeninthisdata.Compareeight-byteaccessspeedsonfour-byteboundaries:

Figure11.Multiple-byte
accesscomparison#3




Noticeaccessingmemoryeightbytesatatimeonfour-andtwelve-byteboundaries
isslowerthanreadingthesamememoryfouroreventwobytesatatime.WhilePowerPCshavehardwaresupportforfour-bytealignedeight-bytedoubles,youstillpayaperformancepenaltyifyouusethatsupport.Granted,it's
nowherenearthe4,610%penalty,butit'scertainlynoticeable.Moralofthestory:accessingmemoryinlargechunkscanbeslowerthanaccessingmemoryinsmallchunks,ifthataccessisnotaligned.

Atomicity

Allmodernprocessorsofferatomicinstructions.Thesespecialinstructionsarecrucialforsynchronizingtwoormoreconcurrenttasks.Asthenameimplies,atomicinstructionsmustbe
indivisible--that'swhythey'resohandyforsynchronization:theycan'tbepreempted.

Itturnsoutthatinorderforatomicinstructionstoperformcorrectly,theaddressesyoupassthemmustbeatleastfour-bytealigned.Thisisbecauseofasubtleinteractionbetweenatomicinstructionsandvirtualmemory.

Ifanaddressisunaligned,itrequiresatleasttwomemoryaccesses.Butwhathappensifthedesireddataspanstwopagesofvirtualmemory?Thiscouldleadtoasituationwherethefirstpageisresidentwhilethelastpageis
not.Uponaccess,inthemiddleoftheinstruction,apagefaultwouldbegenerated,executingthevirtualmemorymanagementswap-incode,destroyingtheatomicityoftheinstruction.Tokeepthingssimpleandcorrect,boththe68KandPowerPCrequirethat
atomicallymanipulatedaddressesalwaysbeatleastfour-bytealigned.

Unfortunately,thePowerPCdoesnotthrowanexceptionwhenatomicallystoringtoanunalignedaddress.Instead,thestoresimplyalwaysfails.Thisisbadbecausemostatomicfunctionsarewrittentoretryuponafailedstore,
undertheassumptiontheywerepreempted.Thesetwocircumstancescombinetowhereyourprogramwillgointoaninfiniteloopifyouattempttoatomicallystoretoanunalignedaddress.Oops.

Altivec

Altivecisallaboutspeed.Unalignedmemoryaccessslowsdowntheprocessorandcostsprecioustransistors.Thus,theAltivecengineerstookapagefromtheMIPSplaybookandsimplydon'tsupportunalignedmemoryaccess.Because
Altivecworkswithsixteen-bytechunksatatime,alladdressespassedtoAltivecmustbesixteen-bytealigned.What'sscaryiswhathappensifyouraddressisnotaligned.

Altivecwon'tthrowanexceptiontowarnyouabouttheunalignedaddress.Instead,Altivecsimplyignoresthelowerfourbitsoftheaddressandchargesahead,
operatingonthewrongaddress.Thismeansyourprogrammaysilentlycorruptmemoryorreturnincorrectresultsifyoudon'texplicitlymakesureallyourdataisaligned.

ThereisanadvantagetoAltivec'sbit-strippingways.Becauseyoudon'tneedtoexplicitlytruncate(align-down)anaddress,thisbehaviorcansaveyouaninstructionortwowhenhandingaddressestotheprocessor.

ThisisnottosayAltiveccan'tprocessunalignedmemory.Youcanfinddetailedinstructionshowtodosoonthe
AltivecProgrammingEnvironmentsManual(see
Resources).Itrequiresmorework,butbecausememoryissoslowcomparedto
theprocessor,theoverheadforsuchshenanigansissurprisinglylow.

Structurealignment

Examinethefollowingstructure:

Listing5.Aninnocent
structure


voidMunge64(void*data,uint32_tsize){

typedefstruct{

chara;

longb;

charc;
}Struct;

Whatisthesizeofthisstructureinbytes?Manyprogrammerswillanswer"6bytes."Itmakessense:onebytefor
a
,fourbytesfor
b
andanotherbytefor
c
.1+4+1equals6.Here'showitwouldlayoutinmemory:
FieldTypeFieldNameFieldOffsetFieldSizeFieldEnd
char
a
011
long
b
145
char
c
516
TotalSizeinBytes:6
However,ifyouweretoaskyourcompilerto
sizeof(Struct)
,chancesaretheansweryou'dgetbackwouldbegreaterthansix,perhapseightoreventwenty-four.There'stworeasonsforthis:backwardscompatibilityandefficiency.

First,backwardscompatibility.Rememberthe68000wasaprocessorwithtwo-bytememoryaccessgranularity,andwouldthrowanexceptionuponencounteringanoddaddress.Ifyouweretoreadfromorwritetofield
b
,you'dattempttoaccessanoddaddress.Ifadebuggerweren'tinstalled,theoldMacOSwouldthrowupaSystemErrordialogboxwithonebutton:Restart.Yikes!

So,insteadoflayingoutyourfieldsjustthewayyouwrotethem,thecompiler
paddedthestructuresothat
b
and
c
wouldresideatevenaddresses:
FieldTypeFieldNameFieldOffsetFieldSizeFieldEnd
char
a
011
padding112
long
b
246
char
c
617
padding718
TotalSizeinBytes:8
Paddingistheactofaddingotherwiseunusedspacetoastructuretomakefieldslineupinadesiredway.Now,whenthe68020cameoutwithbuilt-inhardwaresupportforunalignedmemoryaccess,thispaddingwasunnecessary.However,
itdidn'thurtanything,anditevenhelpedalittleinperformance.

Thesecondreasonisefficiency.Nowadays,onPowerPCmachines,two-bytealignmentisnice,butfour-byteoreight-byteisbetter.Youprobablydon'tcareanymorethattheoriginal68000chokedonunalignedstructures,butyouprobably
careaboutpotential4,610%performancepenalties,whichcanhappenifa
double
fielddoesn'tsitalignedinastructureofyourdevising.
中文代码及其内存解释

内存对齐关键是需要画图!在下面的中文有说明例子


Examinethefollowingstructure:

如果英文看不懂,那么可以直接用中文例如(http://www.cppblog.com/snailcong/archive/2009/03/16/76705.html)来说!


首先由一个程序引入话题:

1//环境:vc6+windowssp2
2//程序1
3#include<iostream>
4
5usingnamespacestd;
6
7structst1
8{
9chara;
10intb;
11shortc;
12};
13
14structst2
15{
16shortc;
17chara;
18intb;
19};
20
21intmain()
22{
23cout<<"sizeof(st1)is"<<sizeof(st1)<<endl;
24cout<<"sizeof(st2)is"<<sizeof(st2)<<endl;
25return0;
26}
27

程序的输出结果为:

sizeof(st1)is12
sizeof(st2)is8

问题出来了,这两个一样的结构体,为什么sizeof的时候大小不一样呢?

本文的主要目的就是解释明白这一问题。

内存对齐,正是因为内存对齐的影响,导致结果不同。

对于大多数的程序员来说,内存对齐基本上是透明的,这是编译器该干的活,编译器为程序中的每个数据单元安排在合适的位置上,从而导致了相同的变量,不同声明顺序的结构体大小的不同。

那么编译器为什么要进行内存对齐呢?程序1中结构体按常理来理解sizeof(st1)和sizeof(st2)结果都应该是7,4(int)+2(short)+1(char)=7。经过内存对齐后,结构体的空间反而增大了。

在解释内存对齐的作用前,先来看下内存对齐的规则:

1、对于结构的各个成员,第一个成员位于偏移为0的位置,以后每个数据成员的偏移量必须是min(#pragmapack()指定的数,这个数据成员的自身长度)的倍数。

2、在数据成员完成各自对齐之后,结构(或联合)本身也要进行对齐,对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中,比较小的那个进行。

#pragmapack(n)表示设置为n字节对齐。VC6默认8字节对齐

以程序1为例解释对齐的规则:

St1:char占一个字节,起始偏移为0,int占4个字节,min(#pragmapack()指定的数,这个数据成员的自身长度)=4(VC6默认8字节对齐),所以int按4字节对齐,起始偏移必须为4的倍数,所以起始偏移为4,在char后编译器会添加3个字节的额外字节,不存放任意数据。short占2个字节,按2字节对齐,起始偏移为8,正好是2的倍数,无须添加额外字节。到此规则1的数据成员对齐结束,此时的内存状态为:

oxxx|oooo|oo

0123456789(地址)

(x表示额外添加的字节)

共占10个字节。还要继续进行结构本身的对齐,对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中,比较小的那个进行,st1结构中最大数据成员长度为int,占4字节,而默认的#pragmapack指定的值为8,所以结果本身按照4字节对齐,结构总大小必须为4的倍数,需添加2个额外字节使结构的总大小为12。此时的内存状态为:

oxxx|oooo|ooxx

0123456789ab(地址)

到此内存对齐结束。St1占用了12个字节而非7个字节。

St2的对齐方法和st1相同,读者可自己完成。

下面再看一个例子http://www.cppblog.com/cc/archive/2006/08/01/10765.html
内存对齐

在我们的程序中,数据结构还有变量等等都需要占有内存,在很多系统中,它都要求内存分配的时候要对齐,这样做的好处就是可以提高访问内存的速度。

我们还是先来看一段简单的程序:

程序一
1#include<iostream>
2usingnamespacestd;
3
4structX1
5{
6inti;//4个字节
7charc1;//1个字节
8charc2;//1个字节
9};
10
11structX2
12{
13charc1;//1个字节
14inti;//4个字节
15charc2;//1个字节
16};
17
18structX3
19{
20charc1;//1个字节
21charc2;//1个字节
22inti;//4个字节
23};
24intmain()
25{
26cout<<"long"<<sizeof(long)<<"\n";
27cout<<"float"<<sizeof(float)<<"\n";
28cout<<"int"<<sizeof(int)<<"\n";
29cout<<"char"<<sizeof(char)<<"\n";
30
31X1x1;
32X2x2;
33X3x3;
34cout<<"x1的大小"<<sizeof(x1)<<"\n";
35cout<<"x2的大小"<<sizeof(x2)<<"\n";
36cout<<"x3的大小"<<sizeof(x3)<<"\n";
37return0;
38}

这段程序的功能很简单,就是定义了三个结构X1,X2,X3,这三个结构的主要区别就是内存数据摆放的顺序,其他都是一样的,另外程序输入了几种基本类型所占用的字节数,以及我们这里的三个结构所占用的字节数。

这段程序的运行结果为:
1long4
2float4
3int4
4char1
5x1的大小8
6x2的大小12
7x3的大小8

结果的前面四行没有什么问题,但是我们在最后三行就可以看到三个结构占用的空间大小不一样,造成这个原因就是内部数据的摆放顺序,怎么会这样呢?

下面就是我们需要讲的内存对齐了。

内存是一个连续的块,我们可以用下面的图来表示,它是以4个字节对一个对齐单位的:

图一



让我们看看三个结构在内存中的布局:

首先是X1,如下图所示



X1中第一个是Int类型,它占有4字节,所以前面4格就是满了,然后第二个是char类型,这中类型只占一个字节,所以它占有了第二个4字节组块中的第一格,第三个也是char类型,所以它也占用一个字节,它就排在了第二个组块的第二格,因为它们加在一起大小也不超过一个块,所以他们三个变量在内存中的结构就是这样的,因为有内存分块对齐,所以最后出来的结果是8,而不是6,因为后面两个格子其实也算是被用了。

再次看看X2,如图所示



X2中第一个类型是Char类型,它占用一个字节,所以它首先排在第一组块的第一个格子里面,第二个是Int类型,它占用4个字节,第一组块已经用掉一格,还剩3格,肯定是无法放下第二Int类型的,因为要考虑到对齐,所以不得不把它放到第二个组块,第三个类型是Char类型,跟第一个类似。所因为有内存分块对齐,我们的内存就不是8个格子了,而是12个了。

再看看X3,如下图所示:



关于X3的说明其实跟X1是类似的,只不过它把两个1个字节的放到了前面,相信看了前面两种情况的说明这里也是很容易理解的。

Whatisthesizeofthisstructureinbytes?Manyprogrammerswillanswer"6bytes."Itmakessense:onebytefor
a
,fourbytesfor
b
andanotherbytefor
c
.1+4+1equals6.Here'showitwouldlayoutinmemory:
FieldTypeFieldNameFieldOffsetFieldSizeFieldEnd
char
a
011
long
b
145
char
c
516
TotalSizeinBytes:6
However,ifyouweretoaskyourcompilerto
sizeof(Struct)
,chancesaretheansweryou'dgetbackwouldbegreaterthansix,perhapseightoreventwenty-four.There'stworeasonsforthis:backwardscompatibilityandefficiency.

First,backwardscompatibility.Rememberthe68000wasaprocessorwithtwo-bytememoryaccessgranularity,andwouldthrowanexceptionuponencounteringanoddaddress.Ifyouweretoreadfromor
writetofield
b
,you'dattempttoaccessanoddaddress.Ifadebuggerweren'tinstalled,theoldMacOSwouldthrowupaSystemErrordialogboxwithonebutton:Restart.Yikes!

So,insteadoflayingoutyourfieldsjustthewayyouwrotethem,thecompiler
paddedthestructuresothat
b
and
c
wouldresideatevenaddresses:
FieldTypeFieldNameFieldOffsetFieldSizeFieldEnd
char
a
011
padding112
long
b
246
char
c
617
padding718
TotalSizeinBytes:8
Paddingistheactofaddingotherwiseunusedspacetoastructuretomakefieldslineupinadesiredway.Now,whenthe68020cameoutwithbuilt-inhardwaresupportforunalignedmemoryaccess,
thispaddingwasunnecessary.However,itdidn'thurtanything,anditevenhelpedalittleinperformance.

Thesecondreasonisefficiency.Nowadays,onPowerPCmachines,two-bytealignmentisnice,butfour-byteoreight-byteisbetter.Youprobablydon'tcareanymorethattheoriginal68000chokedon
unalignedstructures,butyouprobablycareaboutpotential4,610%performancepenalties,whichcanhappenifa
double
fielddoesn'tsitalignedinastructureofyourdevising.



[code]很多人都知道是内存对齐所造成的原因,却鲜有人告诉你内存对齐的基本原理!上面作者就做了解释!





不懂内存对齐将造成的可能影响如下

Yoursoftwaremayhitperformance-killingunalignedmemoryaccessexceptions,whichinvoke
veryexpensivealignmentexceptionhandlers.

Yourapplicationmayattempttoatomicallystoretoanunalignedaddress,causingyourapplicationtolockup.

YourapplicationmayattempttopassanunalignedaddresstoAltivec,resultinginAltivecreadingfromand/orwritingtothewrongpartofmemory,silentlycorruptingdataoryieldingincorrectresults.



内存对齐规划

一、内存对齐的原因

大部分的参考资料都是如是说的:

1、平台原因(移植原因):不是所有的硬件平台都能访问任意地址上的任意数据的;某些硬件平台只能在某些地址处取某些特定类型的数据,否则抛出硬件异常。

2、性能原因:数据结构(尤其是栈)应该尽可能地在自然边界上对齐。原因在于,为了访问未对齐的内存,处理器需要作两次内存访问;而对齐的内存访问仅需要一次访问。

二、对齐规则

每个特定平台上的编译器都有自己的默认“对齐系数”(也叫对齐模数)。程序员可以通过预编译命令#pragmapack(n),n=1,2,4,8,16来改变这一系数,其中的n就是你要指定的“对齐系数”。

规则:

1、数据成员对齐规则:结构(struct)(或联合(union))的数据成员,第一个数据成员放在offset为0的地方,以后每个数据成员的对齐按照#pragmapack指定的数值和这个数据成员

自身长度中,比较小的那个进行。

2、结构(或联合)的整体对齐规则:在数据成员完成各自对齐之后,结构(或联合)本身也要进行对齐,对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中,比较小的那个进行。

3、结合1、2可推断:当#pragmapack的n值等于或超过所有数据成员长度的时候,这个n值的大小将不产生任何效果。

三、试验

下面我们通过一系列例子的详细说明来证明这个规则

编译器:GCC3.4.2、VC6.0

平台:WindowsXP

典型的struct对齐

struct定义:

#pragmapack(n)/*n=1,2,4,8,16*/

structtest_t{

inta;

charb;

shortc;

chard;

};

#pragmapack(n)

首先确认在试验平台上的各个类型的size,经验证两个编译器的输出均为:

sizeof(char)=1

sizeof(short)=2

sizeof(int)=4

试验过程如下:通过#pragmapack(n)改变“对齐系数”,然后察看sizeof(structtest_t)的值。

1、1字节对齐(#pragmapack(1))

输出结果:sizeof(structtest_t)=8[两个编译器输出一致]

分析过程:

1)成员数据对齐

#pragmapack(1)

structtest_t{

inta;/*长度4>1按1对齐;起始offset=00%1=0;存放位置区间[0,3]*/

charb;/*长度1=1按1对齐;起始offset=44%1=0;存放位置区间[4]*/

shortc;/*长度2>1按1对齐;起始offset=55%1=0;存放位置区间[5,6]*/

chard;/*长度1=1按1对齐;起始offset=77%1=0;存放位置区间[7]*/

};

#pragmapack()

成员总大小=8

2)整体对齐

整体对齐系数=min((max(int,short,char),1)=1

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=8/*8%1=0*/[注1]

2、2字节对齐(#pragmapack(2))

输出结果:sizeof(structtest_t)=10[两个编译器输出一致]

分析过程:

1)成员数据对齐

#pragmapack(2)

structtest_t{

inta;/*长度4>2按2对齐;起始offset=00%2=0;存放位置区间[0,3]*/

charb;/*长度1<2按1对齐;起始offset=44%1=0;存放位置区间[4]*/

shortc;/*长度2=2按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/

chard;/*长度1<2按1对齐;起始offset=88%1=0;存放位置区间[8]*/

};

#pragmapack()

成员总大小=9

2)整体对齐

整体对齐系数=min((max(int,short,char),2)=2

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=10/*10%2=0*/

3、4字节对齐(#pragmapack(4))

输出结果:sizeof(structtest_t)=12[两个编译器输出一致]

分析过程:

1)成员数据对齐

#pragmapack(4)

structtest_t{

inta;/*长度4=4按4对齐;起始offset=00%4=0;存放位置区间[0,3]*/

charb;/*长度1<4按1对齐;起始offset=44%1=0;存放位置区间[4]*/

shortc;/*长度2<4按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/

chard;/*长度1<4按1对齐;起始offset=88%1=0;存放位置区间[8]*/

};

#pragmapack()

成员总大小=9

2)整体对齐

整体对齐系数=min((max(int,short,char),4)=4

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/

4、8字节对齐(#pragmapack(8))

输出结果:sizeof(structtest_t)=12[两个编译器输出一致]

分析过程:

1)成员数据对齐

#pragmapack(8)

structtest_t{

inta;/*长度4<8按4对齐;起始offset=00%4=0;存放位置区间[0,3]*/

charb;/*长度1<8按1对齐;起始offset=44%1=0;存放位置区间[4]*/

shortc;/*长度2<8按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/

chard;/*长度1<8按1对齐;起始offset=88%1=0;存放位置区间[8]*/

};

#pragmapack()

成员总大小=9

2)整体对齐

整体对齐系数=min((max(int,short,char),8)=4

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/

5、16字节对齐(#pragmapack(16))

输出结果:sizeof(structtest_t)=12[两个编译器输出一致]

分析过程:

1)成员数据对齐

#pragmapack(16)

structtest_t{

inta;/*长度4<16按4对齐;起始offset=00%4=0;存放位置区间[0,3]*/

charb;/*长度1<16按1对齐;起始offset=44%1=0;存放位置区间[4]*/

shortc;/*长度2<16按2对齐;起始offset=66%2=0;存放位置区间[6,7]*/

chard;/*长度1<16按1对齐;起始offset=88%1=0;存放位置区间[8]*/

};

#pragmapack()

成员总大小=9

2)整体对齐

整体对齐系数=min((max(int,short,char),16)=4

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/

8字节和16字节对齐试验证明了“规则”的第3点:“当#pragmapack的n值等于或超过所有数据成员长度的时候,这个n值的大小将不产生任何效果”。

内存分配与内存对齐是个很复杂的东西,不但与具体实现密切相关,而且在不同的操作系统,编译器或硬件平台上规则也不尽相同,虽然目前大多数系统/语言都具有自动管理、分配并隐藏低层操作的功能,使得应用程序编写大为简单,程序员不在需要考虑详细的内存分配问题。但是,在系统或驱动级以至于高实时,高保密性的程序开发过程中,程序内存分配问题仍旧是保证整个程序稳定,安全,高效的基础。

[注1]
什么是“圆整”?
举例说明:如上面的8字节对齐中的“整体对齐”,整体大小=9
按4圆整=12
圆整的过程:从9开始每次加一,看是否能被4整除,这里9,10,11均不能被4整除,到12时可以,则圆整结束。



五作者

JonathanRentzschhttp://www.ibm.com/developerworks/library/pa-dalign/http://www.cppblog.com/cc/archive/2006/08/01/10765.html(中文优秀解释)http://www.cppblog.com/snailcong/archive/2009/03/16/76705.html(对英文版的消化,可以查看该博客)http://blogold.chinaunix.net/u3/118340/showart_2615855.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: