您的位置：首页 > 其它

从内存的角度解释内存对齐的原理

2012-03-16 22:08 218 查看

目录
题记

一
内存读取粒度
Memoryaccessgranularity

从内存的角度解释内存对齐的原理

队列原理Alignmentfundamentals

Lazyprocessors

二速度Speed（内存对齐的基本原理）

代码解释

中文代码及其内存解释

三不懂内存对齐将造成的可能影响如下

四内存对齐规划

内存对齐的原因

对齐规则

试验

五作者

题记
下面的文章中是我对四个博客文章的合成，非原创，解释了内存对齐的原因，作用（中英文说明）,及其规划！尤其适用于对sizeof结构体。
首先解释了内存对齐的原理，然后对作用进行了说明，最后是例子！其中中文对内存对齐，原作者做了详细的说明及其例子解释，需要注意的是，如

struct
{
chara;
intb;
charb
}A;

a在分配时候占用其中一个字节，剩下3个，但是b分配的是4字节，明显3个字节无法满足，那么就需要另外写入队列
人觉得第二个中文作者（按我最后说明博客地址顺序）提到的最重要的是画图是一个很不错的方法.
我在引用的第四个博客中，也就是最后的博客中，通过详细的代码解释，说明了内存对齐的规划问题！

内存对齐在系统或驱动级别以至于高真实时，高保密的程序开发的时候，程序内存分配问题仍旧是保证整个程序稳定，安全，高效的基础。所以对内存对齐需要学会掌握！至少在CSDN能说的来头!

文章可能很杂，如果看不懂，可以直接浏览中文部分，虽然我的博客很少人来看。
--QQ124045670

一内存读取粒度
Memoryaccessgranularity

从内存的角度解释内存对齐的原理

Programmersareconditionedtothinkofmemoryasasimplearrayofbytes.AmongCanditsdescendants,

char*

isubiquitousasmeaning"ablockofmemory",andevenJava?hasits

byte[]

typetorepresentrawmemory.

Figure1.Howprogrammers
seememory

However,yourcomputer'sprocessordoesnotreadfromandwritetomemoryinbyte-sizedchunks.Instead,itaccessesmemoryintwo-,four-,eight-16-oreven32-bytechunks.We'llcallthesizeinwhichaprocessoraccessesmemory
itsmemoryaccessgranularity.

Figure2.Howprocessors
seememory

Thedifferencebetweenhowhigh-levelprogrammersthinkofmemoryandhowmodernprocessorsactuallyworkwithmemoryraisesinterestingissuesthatthisarticleexplores.

Ifyoudon'tunderstandandaddressalignmentissuesinyoursoftware,thefollowingscenarios,inincreasingorderofseverity,areallpossible:

Yoursoftwarewillrunslower.

Yourapplicationwilllockup.

Youroperatingsystemwillcrash.

Yoursoftwarewillsilentlyfail,yieldingincorrectresults.

队列原理
Alignmentfundamentals

Toillustratetheprinciplesbehindalignment,examineaconstanttask,andhowit'saffectedbyaprocessor'smemoryaccessgranularity.Thetaskissimple:firstreadfourbytesfromaddress0intotheprocessor'sregister.Then
readfourbytesfromaddress1intothesameregister.

Firstexaminewhatwouldhappenonaprocessorwithaone-bytememoryaccessgranularity:

Figure3.Single-byte
memoryaccessgranularity

Thisfitsinwiththenaiveprogrammer'smodelofhowmemoryworks:ittakesthesamefourmemoryaccessestoreadfromaddress0asitdoesfromaddress1.Nowseewhatwouldhappenonaprocessorwithtwo-bytegranularity,like
theoriginal68000:

Figure4.Double-byte
memoryaccessgranularity

Whenreadingfromaddress0,aprocessorwithtwo-bytegranularitytakeshalfthenumberofmemoryaccessesasaprocessorwithone-bytegranularity.Becauseeachmemoryaccessentailsafixedamountoverhead,minimizingthenumber
ofaccessescanreallyhelpperformance.

However,noticewhathappenswhenreadingfromaddress1.Becausetheaddressdoesn'tfallevenlyontheprocessor'smemoryaccessboundary,theprocessorhasextraworktodo.Suchanaddressisknownasan
unalignedaddress.Becauseaddress1isunaligned,aprocessorwithtwo-bytegranularitymustperformanextramemoryaccess,slowingdowntheoperation.

Finally,examinewhatwouldhappenonaprocessorwithfour-bytememoryaccessgranularity,likethe68030orPowerPC?601:

Figure5.Quad-bytememory
accessgranularity

Aprocessorwithfour-bytegranularitycanslurpupfourbytesfromanalignedaddresswithoneread.Alsonotethatreadingfromanunalignedaddressdoublestheaccesscount.

Nowthatyouunderstandthefundamentalsbehindaligneddataaccess,youcanexploresomeoftheissuesrelatedtoalignment.

Lazyprocessors

Aprocessorhastoperformsometrickswheninstructedtoaccessanunalignedaddress.Goingbacktotheexampleofreadingfourbytesfromaddress1onaprocessorwithfour-bytegranularity,youcanworkoutexactlywhatneeds
tobedone:

Figure6.Howprocessors
handleunalignedmemoryaccess

Theprocessorneedstoreadthefirstchunkoftheunalignedaddressandshiftoutthe"unwanted"bytesfromthefirstchunk.Thenitneedstoreadthesecondchunkoftheunalignedaddressandshiftoutsomeofitsinformation.
Finally,thetwoaremergedtogetherforplacementintheregister.It'salotofwork.

Someprocessorsjustaren'twillingtodoallofthatworkforyou.

Theoriginal68000wasaprocessorwithtwo-bytegranularityandlackedthecircuitrytocopewithunalignedaddresses.Whenpresentedwithsuchanaddress,theprocessorwouldthrowanexception.TheoriginalMacOSdidn'ttake
verykindlytothisexception,andwouldusuallydemandtheuserrestartthemachine.Ouch.

Laterprocessorsinthe680x0series,suchasthe68020,liftedthisrestrictionandperformedthenecessaryworkforyou.Thisexplainswhysomeoldsoftwarethatworksonthe68020crashesonthe68000.Italsoexplainswhy,way
backwhen,someoldMaccodersinitializedpointerswithoddaddresses.OntheoriginalMac,ifthepointerwasaccessedwithoutbeingreassignedtoavalidaddress,theMacwouldimmediatelydropintothedebugger.Oftentheycouldthenexaminethecalling
chainstackandfigureoutwherethemistakewas.

Allprocessorshaveafinitenumberoftransistorstogetworkdone.Addingunalignedaddressaccesssupportcutsintothis"transistorbudget."Thesetransistorscouldotherwisebeusedtomakeotherportionsoftheprocessorwork
faster,oraddnewfunctionalityaltogether.

AnexampleofaprocessorthatsacrificesunalignedaddressaccesssupportinthenameofspeedisMIPS.MIPSisagreatexampleofaprocessorthatdoesawaywithalmostallfrivolityinthenameofgettingrealworkdonefaster.

ThePowerPCtakesahybridapproach.EveryPowerPCprocessortodatehashardwaresupportforunaligned32-bitintegeraccess.Whileyoustillpayaperformancepenaltyforunalignedaccess,ittendstobesmall.

Ontheotherhand,modernPowerPCprocessorslackhardwaresupportforunaligned64-bitfloating-pointaccess.Whenaskedtoloadanunalignedfloating-pointnumberfrommemory,modernPowerPCprocessorswillthrowanexception
andhavetheoperatingsystemperformthealignmentchores
insoftware.Performingalignmentinsoftwareis
muchslowerthanperformingitinhardware.

二
速度Speed（内存对齐的基本原理）

内存对齐有一个好处是提高访问内存的速度，因为在许多数据结构中都需要占用内存，在很多系统中，要求内存分配的时候要对齐.下面是对为什么可以提高内存速度通过代码做了解释！
代码解释

Writingsometestsillustratestheperformancepenaltiesofunalignedmemoryaccess.Thetestissimple:youread,negate,andwritebackthenumbersinaten-megabytebuffer.Thesetestshavetwovariables:

Thesize,inbytes,inwhichyouprocessthebuffer.Firstyou'llprocessthebufferonebyteatatime.Thenyou'llmoveontotwo-,four-andeight-bytesatatime.

Thealignmentofthebuffer.You'llstaggerthealignmentofthebufferbyincrementingthepointertothebufferandrunningeachtestagain.

Thesetestswereperformedona800MHzPowerBookG4.Tohelpnormalizeperformancefluctuationsfrominterruptprocessing,eachtestwasruntentimes,keepingtheaverageoftheruns.Firstupisthetestthatoperatesonasingle
byteatatime:
Listing1.Mungingdataonebyteatatime

voidMunge8(void*data,uint32_tsize){
uint8_t*data8=(uint8_t*)data;
uint8_t*data8End=data8+size;

while(data8!=data8End){
*data8++=-*data8;
}
}

Ittookanaverageof67,364microsecondstoexecutethisfunction.Nowmodifyittoworkontwobytesatatimeinsteadofonebyteatatime--whichwillhalvethenumberofmemoryaccesses:

Listing2.Mungingdata
twobytesatatime

voidMunge16(void*data,uint32_tsize){
uint16_t*data16=(uint16_t*)data;
uint16_t*data16End=data16+(size>>1);/*Dividesizeby2.*/
uint8_t*data8=(uint8_t*)data16End;
uint8_t*data8End=data8+(size&0x00000001);/*Stripupper31bits.*/

while(data16!=data16End){
*data16++=-*data16;
}
while(data8!=data8End){
*data8++=-*data8;
}
}

Thisfunctiontook48,765microsecondstoprocessthesameten-megabytebuffer--38%fasterthanMunge8.However,thatbufferwasaligned.Ifthebufferisunaligned,thetimerequiredincreasesto66,385microseconds--about
a27%speedpenalty.Thefollowingchartillustratestheperformancepatternofalignedmemoryaccessesversusunalignedaccesses:

Figure7.Single-byte
accessversusdouble-byteaccess

Thefirstthingyounoticeisthataccessingmemoryonebyteatatimeisuniformlyslow.Theseconditemofinterestisthatwhenaccessingmemorytwobytesatatime,whenevertheaddressisnotevenlydivisiblebytwo,that27%
speedpenaltyrearsitsuglyhead.

Nowuptheante,andprocessthebufferfourbytesatatime:

Listing3.Mungingdata
fourbytesatatime

voidMunge16(void*data,uint32_tsize){
uint16_t*data16=(uint16_t*)data;
uint16_t*data16End=data16+(size>>1);/*Dividesizeby2.*/
uint8_t*data8=(uint8_t*)data16End;
uint8_t*data8End=data8+(size&0x00000001);/*Stripupper31bits.*/

while(data16!=data16End){
*data16++=-*data16;
}
while(data8!=data8End){
*data8++=-*data8;
}
}

Thisfunctionprocessesanalignedbufferin43,043microsecondsandanunalignedbufferin55,775microseconds,respectively.Thus,onthistestmachine,accessingunalignedmemoryfourbytesatatimeis
slowerthanaccessingalignedmemorytwobytesatatime:

Figure8.Single-versus
double-versusquad-byteaccess

Nowforthehorrorstory:processingthebuffereightbytesatatime.

Listing4.Mungingdata
eightbytesatatime

voidMunge32(void*data,uint32_tsize){
uint32_t*data32=(uint32_t*)data;
uint32_t*data32End=data32+(size>>2);/*Dividesizeby4.*/
uint8_t*data8=(uint8_t*)data32End;
uint8_t*data8End=data8+(size&0x00000003);/*Stripupper30bits.*/

while(data32!=data32End){
*data32++=-*data32;
}
while(data8!=data8End){
*data8++=-*data8;
}
}

Munge64

processesanalignedbufferin39,085microseconds--about10%fasterthanprocessingthebufferfourbytesatatime.However,processinganunalignedbuffer
takesanamazing1,841,155microseconds--twoordersofmagnitudeslowerthanalignedaccess,anoutstanding4,610%performancepenalty!

Whathappened?BecausemodernPowerPCprocessorslackhardwaresupportforunalignedfloating-pointaccess,theprocessorthrowsanexception
foreachunalignedaccess.Theoperatingsystemcatchesthisexceptionandperformsthealignmentinsoftware.Here'sachartillustratingthepenalty,andwhenitoccurs:

Figure9.Multiple-byte
accesscomparison

Thepenaltiesforone-,two-andfour-byteunalignedaccessaredwarfedbythehorrendousunalignedeight-bytepenalty.Maybethischart,removingthetop(andthusthetremendousgulfbetweenthetwonumbers),willbeclearer:

Figure10.Multiple-byte
accesscomparison#2

There'sanothersubtleinsighthiddeninthisdata.Compareeight-byteaccessspeedsonfour-byteboundaries:

Figure11.Multiple-byte
accesscomparison#3

Noticeaccessingmemoryeightbytesatatimeonfour-andtwelve-byteboundaries
isslowerthanreadingthesamememoryfouroreventwobytesatatime.WhilePowerPCshavehardwaresupportforfour-bytealignedeight-bytedoubles,youstillpayaperformancepenaltyifyouusethatsupport.Granted,it's
nowherenearthe4,610%penalty,butit'scertainlynoticeable.Moralofthestory:accessingmemoryinlargechunkscanbeslowerthanaccessingmemoryinsmallchunks,ifthataccessisnotaligned.

Atomicity

Allmodernprocessorsofferatomicinstructions.Thesespecialinstructionsarecrucialforsynchronizingtwoormoreconcurrenttasks.Asthenameimplies,atomicinstructionsmustbe
indivisible--that'swhythey'resohandyforsynchronization:theycan'tbepreempted.

Itturnsoutthatinorderforatomicinstructionstoperformcorrectly,theaddressesyoupassthemmustbeatleastfour-bytealigned.Thisisbecauseofasubtleinteractionbetweenatomicinstructionsandvirtualmemory.

Ifanaddressisunaligned,itrequiresatleasttwomemoryaccesses.Butwhathappensifthedesireddataspanstwopagesofvirtualmemory?Thiscouldleadtoasituationwherethefirstpageisresidentwhilethelastpageis
not.Uponaccess,inthemiddleoftheinstruction,apagefaultwouldbegenerated,executingthevirtualmemorymanagementswap-incode,destroyingtheatomicityoftheinstruction.Tokeepthingssimpleandcorrect,boththe68KandPowerPCrequirethat
atomicallymanipulatedaddressesalwaysbeatleastfour-bytealigned.

Unfortunately,thePowerPCdoesnotthrowanexceptionwhenatomicallystoringtoanunalignedaddress.Instead,thestoresimplyalwaysfails.Thisisbadbecausemostatomicfunctionsarewrittentoretryuponafailedstore,
undertheassumptiontheywerepreempted.Thesetwocircumstancescombinetowhereyourprogramwillgointoaninfiniteloopifyouattempttoatomicallystoretoanunalignedaddress.Oops.

Altivec

Altivecisallaboutspeed.Unalignedmemoryaccessslowsdowntheprocessorandcostsprecioustransistors.Thus,theAltivecengineerstookapagefromtheMIPSplaybookandsimplydon'tsupportunalignedmemoryaccess.Because
Altivecworkswithsixteen-bytechunksatatime,alladdressespassedtoAltivecmustbesixteen-bytealigned.What'sscaryiswhathappensifyouraddressisnotaligned.

Altivecwon'tthrowanexceptiontowarnyouabouttheunalignedaddress.Instead,Altivecsimplyignoresthelowerfourbitsoftheaddressandchargesahead,
operatingonthewrongaddress.Thismeansyourprogrammaysilentlycorruptmemoryorreturnincorrectresultsifyoudon'texplicitlymakesureallyourdataisaligned.

ThereisanadvantagetoAltivec'sbit-strippingways.Becauseyoudon'tneedtoexplicitlytruncate(align-down)anaddress,thisbehaviorcansaveyouaninstructionortwowhenhandingaddressestotheprocessor.

ThisisnottosayAltiveccan'tprocessunalignedmemory.Youcanfinddetailedinstructionshowtodosoonthe
AltivecProgrammingEnvironmentsManual(see
Resources).Itrequiresmorework,butbecausememoryissoslowcomparedto
theprocessor,theoverheadforsuchshenanigansissurprisinglylow.

Structurealignment

Examinethefollowingstructure:

Listing5.Aninnocent
structure

voidMunge64(void*data,uint32_tsize){

typedefstruct{

chara;

longb;

charc;
}Struct;

Whatisthesizeofthisstructureinbytes?Manyprogrammerswillanswer"6bytes."Itmakessense:onebytefor

,fourbytesfor

andanotherbytefor

.1+4+1equals6.Here'showitwouldlayoutinmemory:

FieldType	FieldName	FieldOffset	FieldSize	FieldEnd
char	a	0	1	1
long	b	1	4	5
char	c	5	1	6
TotalSizeinBytes:	6

However,ifyouweretoaskyourcompilerto

sizeof(Struct)

,chancesaretheansweryou'dgetbackwouldbegreaterthansix,perhapseightoreventwenty-four.There'stworeasonsforthis:backwardscompatibilityandefficiency.

First,backwardscompatibility.Rememberthe68000wasaprocessorwithtwo-bytememoryaccessgranularity,andwouldthrowanexceptionuponencounteringanoddaddress.Ifyouweretoreadfromorwritetofield

,you'dattempttoaccessanoddaddress.Ifadebuggerweren'tinstalled,theoldMacOSwouldthrowupaSystemErrordialogboxwithonebutton:Restart.Yikes!

So,insteadoflayingoutyourfieldsjustthewayyouwrotethem,thecompiler
paddedthestructuresothat

and

wouldresideatevenaddresses:

FieldType	FieldName	FieldOffset	FieldSize	FieldEnd
char	a	0	1	1
padding	1	1	2
long	b	2	4	6
char	c	6	1	7
padding	7	1	8
TotalSizeinBytes:	8

Paddingistheactofaddingotherwiseunusedspacetoastructuretomakefieldslineupinadesiredway.Now,whenthe68020cameoutwithbuilt-inhardwaresupportforunalignedmemoryaccess,thispaddingwasunnecessary.However,
itdidn'thurtanything,anditevenhelpedalittleinperformance.

Thesecondreasonisefficiency.Nowadays,onPowerPCmachines,two-bytealignmentisnice,butfour-byteoreight-byteisbetter.Youprobablydon'tcareanymorethattheoriginal68000chokedonunalignedstructures,butyouprobably
careaboutpotential4,610%performancepenalties,whichcanhappenifa

double

fielddoesn'tsitalignedinastructureofyourdevising.
中文代码及其内存解释

内存对齐关键是需要画图！在下面的中文有说明例子

Examinethefollowingstructure:

如果英文看不懂，那么可以直接用中文例如（http://www.cppblog.com/snailcong/archive/2009/03/16/76705.html）来说！

首先由一个程序引入话题：

1//环境：vc6+windowssp2
2//程序1
3#include<iostream>
4
5usingnamespacestd;
6
7structst1
8{
9chara;
10intb;
11shortc;
12};
13
14structst2
15{
16shortc;
17chara;
18intb;
19};
20
21intmain()
22{
23cout<<"sizeof(st1)is"<<sizeof(st1)<<endl;
24cout<<"sizeof(st2)is"<<sizeof(st2)<<endl;
25return0;
26}
27

程序的输出结果为：

sizeof(st1)is12
sizeof(st2)is8

问题出来了，这两个一样的结构体，为什么sizeof的时候大小不一样呢？

本文的主要目的就是解释明白这一问题。

内存对齐，正是因为内存对齐的影响，导致结果不同。

对于大多数的程序员来说，内存对齐基本上是透明的，这是编译器该干的活，编译器为程序中的每个数据单元安排在合适的位置上，从而导致了相同的变量，不同声明顺序的结构体大小的不同。

那么编译器为什么要进行内存对齐呢？程序1中结构体按常理来理解sizeof(st1)和sizeof(st2)结果都应该是7，4(int)+2(short)+1(char)=7。经过内存对齐后，结构体的空间反而增大了。

在解释内存对齐的作用前，先来看下内存对齐的规则：

1、对于结构的各个成员，第一个成员位于偏移为0的位置，以后每个数据成员的偏移量必须是min(#pragmapack()指定的数，这个数据成员的自身长度)的倍数。

2、在数据成员完成各自对齐之后，结构(或联合)本身也要进行对齐，对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中，比较小的那个进行。

#pragmapack(n)表示设置为n字节对齐。VC6默认8字节对齐

以程序1为例解释对齐的规则：

St1：char占一个字节，起始偏移为0，int占4个字节，min(#pragmapack()指定的数，这个数据成员的自身长度)=4（VC6默认8字节对齐），所以int按4字节对齐，起始偏移必须为4的倍数，所以起始偏移为4，在char后编译器会添加3个字节的额外字节，不存放任意数据。short占2个字节，按2字节对齐，起始偏移为8，正好是2的倍数，无须添加额外字节。到此规则1的数据成员对齐结束，此时的内存状态为：

oxxx|oooo|oo

0123456789（地址）

（x表示额外添加的字节）

共占10个字节。还要继续进行结构本身的对齐，对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中，比较小的那个进行，st1结构中最大数据成员长度为int，占4字节，而默认的#pragmapack指定的值为8，所以结果本身按照4字节对齐，结构总大小必须为4的倍数，需添加2个额外字节使结构的总大小为12。此时的内存状态为：

oxxx|oooo|ooxx

0123456789ab（地址）

到此内存对齐结束。St1占用了12个字节而非7个字节。

St2的对齐方法和st1相同，读者可自己完成。

下面再看一个例子http://www.cppblog.com/cc/archive/2006/08/01/10765.html
内存对齐

在我们的程序中，数据结构还有变量等等都需要占有内存，在很多系统中，它都要求内存分配的时候要对齐，这样做的好处就是可以提高访问内存的速度。

我们还是先来看一段简单的程序：

程序一
1#include<iostream>
2usingnamespacestd;
3
4structX1
5{
6inti;//4个字节
7charc1;//1个字节
8charc2;//1个字节
9};
10
11structX2
12{
13charc1;//1个字节
14inti;//4个字节
15charc2;//1个字节
16};
17
18structX3
19{
20charc1;//1个字节
21charc2;//1个字节
22inti;//4个字节
23};
24intmain()
25{
26cout<<"long"<<sizeof(long)<<"\n";
27cout<<"float"<<sizeof(float)<<"\n";
28cout<<"int"<<sizeof(int)<<"\n";
29cout<<"char"<<sizeof(char)<<"\n";
30
31X1x1;
32X2x2;
33X3x3;
34cout<<"x1的大小"<<sizeof(x1)<<"\n";
35cout<<"x2的大小"<<sizeof(x2)<<"\n";
36cout<<"x3的大小"<<sizeof(x3)<<"\n";
37return0;
38}

这段程序的功能很简单，就是定义了三个结构X1,X2,X3,这三个结构的主要区别就是内存数据摆放的顺序，其他都是一样的，另外程序输入了几种基本类型所占用的字节数，以及我们这里的三个结构所占用的字节数。

这段程序的运行结果为：
1long4
2float4
3int4
4char1
5x1的大小8
6x2的大小12
7x3的大小8

结果的前面四行没有什么问题，但是我们在最后三行就可以看到三个结构占用的空间大小不一样，造成这个原因就是内部数据的摆放顺序，怎么会这样呢？

下面就是我们需要讲的内存对齐了。

内存是一个连续的块，我们可以用下面的图来表示,它是以4个字节对一个对齐单位的：

图一

让我们看看三个结构在内存中的布局：

首先是X1，如下图所示

X1中第一个是Int类型，它占有4字节，所以前面4格就是满了，然后第二个是char类型，这中类型只占一个字节，所以它占有了第二个4字节组块中的第一格，第三个也是char类型，所以它也占用一个字节，它就排在了第二个组块的第二格，因为它们加在一起大小也不超过一个块，所以他们三个变量在内存中的结构就是这样的，因为有内存分块对齐，所以最后出来的结果是8，而不是6，因为后面两个格子其实也算是被用了。

再次看看X2，如图所示

X2中第一个类型是Char类型，它占用一个字节，所以它首先排在第一组块的第一个格子里面，第二个是Int类型，它占用4个字节，第一组块已经用掉一格，还剩3格，肯定是无法放下第二Int类型的，因为要考虑到对齐，所以不得不把它放到第二个组块，第三个类型是Char类型，跟第一个类似。所因为有内存分块对齐，我们的内存就不是8个格子了，而是12个了。

再看看X3，如下图所示：

关于X3的说明其实跟X1是类似的，只不过它把两个1个字节的放到了前面，相信看了前面两种情况的说明这里也是很容易理解的。

Whatisthesizeofthisstructureinbytes?Manyprogrammerswillanswer"6bytes."Itmakessense:onebytefor

,fourbytesfor

andanotherbytefor

.1+4+1equals6.Here'showitwouldlayoutinmemory:

FieldType	FieldName	FieldOffset	FieldSize	FieldEnd
char	a	0	1	1
long	b	1	4	5
char	c	5	1	6
TotalSizeinBytes:	6

However,ifyouweretoaskyourcompilerto

sizeof(Struct)

and

wouldresideatevenaddresses:

FieldType	FieldName	FieldOffset	FieldSize	FieldEnd
char	a	0	1	1
padding	1	1	2
long	b	2	4	6
char	c	6	1	7
padding	7	1	8
TotalSizeinBytes:	8

Paddingistheactofaddingotherwiseunusedspacetoastructuretomakefieldslineupinadesiredway.Now,whenthe68020cameoutwithbuilt-inhardwaresupportforunalignedmemoryaccess,
thispaddingwasunnecessary.However,itdidn'thurtanything,anditevenhelpedalittleinperformance.

Thesecondreasonisefficiency.Nowadays,onPowerPCmachines,two-bytealignmentisnice,butfour-byteoreight-byteisbetter.Youprobablydon'tcareanymorethattheoriginal68000chokedon
unalignedstructures,butyouprobablycareaboutpotential4,610%performancepenalties,whichcanhappenifa

double

fielddoesn'tsitalignedinastructureofyourdevising.

[code]很多人都知道是内存对齐所造成的原因，却鲜有人告诉你内存对齐的基本原理！上面作者就做了解释!

三
不懂内存对齐将造成的可能影响如下

Yoursoftwaremayhitperformance-killingunalignedmemoryaccessexceptions,whichinvoke
veryexpensivealignmentexceptionhandlers.

Yourapplicationmayattempttoatomicallystoretoanunalignedaddress,causingyourapplicationtolockup.

YourapplicationmayattempttopassanunalignedaddresstoAltivec,resultinginAltivecreadingfromand/orwritingtothewrongpartofmemory,silentlycorruptingdataoryieldingincorrectresults.

四
内存对齐规划

一、内存对齐的原因

大部分的参考资料都是如是说的：

1、平台原因(移植原因)：不是所有的硬件平台都能访问任意地址上的任意数据的；某些硬件平台只能在某些地址处取某些特定类型的数据，否则抛出硬件异常。

2、性能原因：数据结构(尤其是栈)应该尽可能地在自然边界上对齐。原因在于，为了访问未对齐的内存，处理器需要作两次内存访问；而对齐的内存访问仅需要一次访问。

二、对齐规则

每个特定平台上的编译器都有自己的默认“对齐系数”(也叫对齐模数)。程序员可以通过预编译命令#pragmapack(n)，n=1,2,4,8,16来改变这一系数，其中的n就是你要指定的“对齐系数”。

规则：

1、数据成员对齐规则：结构(struct)(或联合(union))的数据成员，第一个数据成员放在offset为0的地方，以后每个数据成员的对齐按照#pragmapack指定的数值和这个数据成员

自身长度中，比较小的那个进行。

2、结构(或联合)的整体对齐规则：在数据成员完成各自对齐之后，结构(或联合)本身也要进行对齐，对齐将按照#pragmapack指定的数值和结构(或联合)最大数据成员长度中，比较小的那个进行。

3、结合1、2可推断：当#pragmapack的n值等于或超过所有数据成员长度的时候，这个n值的大小将不产生任何效果。

三、试验

下面我们通过一系列例子的详细说明来证明这个规则

编译器：GCC3.4.2、VC6.0

平台：WindowsXP

典型的struct对齐

struct定义：

#pragmapack(n)/*n=1,2,4,8,16*/

structtest_t{

inta;

charb;

shortc;

chard;

};

#pragmapack(n)

首先确认在试验平台上的各个类型的size，经验证两个编译器的输出均为：

sizeof(char)=1

sizeof(short)=2

sizeof(int)=4

试验过程如下：通过#pragmapack(n)改变“对齐系数”，然后察看sizeof(structtest_t)的值。

1、1字节对齐(#pragmapack(1))

输出结果：sizeof(structtest_t)=8[两个编译器输出一致]

分析过程：

1)成员数据对齐

#pragmapack(1)

structtest_t{

inta;/*长度4>1按1对齐；起始offset=00%1=0；存放位置区间[0,3]*/

charb;/*长度1=1按1对齐；起始offset=44%1=0；存放位置区间[4]*/

shortc;/*长度2>1按1对齐；起始offset=55%1=0；存放位置区间[5,6]*/

chard;/*长度1=1按1对齐；起始offset=77%1=0；存放位置区间[7]*/

};

#pragmapack()

成员总大小=8

2)整体对齐

整体对齐系数=min((max(int,short,char),1)=1

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=8/*8%1=0*/[注1]

2、2字节对齐(#pragmapack(2))

输出结果：sizeof(structtest_t)=10[两个编译器输出一致]

分析过程：

1)成员数据对齐

#pragmapack(2)

structtest_t{

inta;/*长度4>2按2对齐；起始offset=00%2=0；存放位置区间[0,3]*/

charb;/*长度1<2按1对齐；起始offset=44%1=0；存放位置区间[4]*/

shortc;/*长度2=2按2对齐；起始offset=66%2=0；存放位置区间[6,7]*/

chard;/*长度1<2按1对齐；起始offset=88%1=0；存放位置区间[8]*/

};

#pragmapack()

成员总大小=9

2)整体对齐

整体对齐系数=min((max(int,short,char),2)=2

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=10/*10%2=0*/

3、4字节对齐(#pragmapack(4))

输出结果：sizeof(structtest_t)=12[两个编译器输出一致]

分析过程：

1)成员数据对齐

#pragmapack(4)

structtest_t{

inta;/*长度4=4按4对齐；起始offset=00%4=0；存放位置区间[0,3]*/

charb;/*长度1<4按1对齐；起始offset=44%1=0；存放位置区间[4]*/

shortc;/*长度2<4按2对齐；起始offset=66%2=0；存放位置区间[6,7]*/

chard;/*长度1<4按1对齐；起始offset=88%1=0；存放位置区间[8]*/

};

#pragmapack()

成员总大小=9

2)整体对齐

整体对齐系数=min((max(int,short,char),4)=4

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/

4、8字节对齐(#pragmapack(8))

输出结果：sizeof(structtest_t)=12[两个编译器输出一致]

分析过程：

1)成员数据对齐

#pragmapack(8)

structtest_t{

inta;/*长度4<8按4对齐；起始offset=00%4=0；存放位置区间[0,3]*/

charb;/*长度1<8按1对齐；起始offset=44%1=0；存放位置区间[4]*/

shortc;/*长度2<8按2对齐；起始offset=66%2=0；存放位置区间[6,7]*/

chard;/*长度1<8按1对齐；起始offset=88%1=0；存放位置区间[8]*/

};

#pragmapack()

成员总大小=9

2)整体对齐

整体对齐系数=min((max(int,short,char),8)=4

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/

5、16字节对齐(#pragmapack(16))

输出结果：sizeof(structtest_t)=12[两个编译器输出一致]

分析过程：

1)成员数据对齐

#pragmapack(16)

structtest_t{

inta;/*长度4<16按4对齐；起始offset=00%4=0；存放位置区间[0,3]*/

charb;/*长度1<16按1对齐；起始offset=44%1=0；存放位置区间[4]*/

shortc;/*长度2<16按2对齐；起始offset=66%2=0；存放位置区间[6,7]*/

chard;/*长度1<16按1对齐；起始offset=88%1=0；存放位置区间[8]*/

};

#pragmapack()

成员总大小=9

2)整体对齐

整体对齐系数=min((max(int,short,char),16)=4

整体大小(size)=$(成员总大小)按$(整体对齐系数)圆整=12/*12%4=0*/

8字节和16字节对齐试验证明了“规则”的第3点：“当#pragmapack的n值等于或超过所有数据成员长度的时候，这个n值的大小将不产生任何效果”。

内存分配与内存对齐是个很复杂的东西，不但与具体实现密切相关，而且在不同的操作系统，编译器或硬件平台上规则也不尽相同，虽然目前大多数系统/语言都具有自动管理、分配并隐藏低层操作的功能，使得应用程序编写大为简单，程序员不在需要考虑详细的内存分配问题。但是，在系统或驱动级以至于高实时，高保密性的程序开发过程中，程序内存分配问题仍旧是保证整个程序稳定，安全，高效的基础。

[注1]
什么是“圆整”？
举例说明：如上面的8字节对齐中的“整体对齐”，整体大小=9
按4圆整=12
圆整的过程：从9开始每次加一，看是否能被4整除，这里9，10，11均不能被4整除，到12时可以，则圆整结束。

五作者

JonathanRentzschhttp://www.ibm.com/developerworks/library/pa-dalign/http://www.cppblog.com/cc/archive/2006/08/01/10765.html（中文优秀解释）http://www.cppblog.com/snailcong/archive/2009/03/16/76705.html(对英文版的消化，可以查看该博客)http://blogold.chinaunix.net/u3/118340/showart_2615855.html

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航