您的位置:首页 > 大数据 > 人工智能

理解LGWR,Log File Sync Waits以及Commit的性能问题[转]

2013-11-14 11:32 411 查看
理解LGWR,LogFileSyncWaits以及Commit的性能问题

一.概要:

1.Commit和logfilesync的工作机制

2.为什么logfilewait太久

3.如何去度量问题出在那里呢?

二.logfilesync等待的原因

1.默认情况下我们commit一个事务是要等待logfilesync,这其中包括:

(1)Usercommit(用户提交的统计信息可以通过v$sesstat来查看)

(2)DDL-这一部分主要是由于递归的事务提交所产生

(3)递归的数据字典DML操作

2.Rollbacks导致logfilesync等待

(1)Userrollbacks-用户或者由应用发出的rollback操作所致

(2)Transactionrollbacks:1,由于一些失败的操作导致oracle内部的rollback2.空间分配,或者ASSM相关的问题,以及用户取消的长查询,被kill掉的session等等。

下图为Commit和logfilesync相关的流程图:



Logfilesyncperformance>diskIOspeed


****大多数logfilesync的等待时间其实都是花费在logfileparallelwrite,类似与DBWR会等待dbfileparallelwrite

****其它的logfilesync等待花费在调度延迟,IPC通信延迟等等



1.前台进程对LGWR发出调用,然后到sleep状态下面看看Logfilesync等待的整个流程:

此时logfilesync等待开始记数

次调用在Unix平台是通过信号量来实现

2.LGWR被唤醒,得到CPU时间片来工作

LGWR发出IO请求

LGWR转去sleep,并且等待logfileparallelwrite

3.当在存储级别完成IO调用后OS唤醒LGWR进程

LGWR继续去获得CPU时间片

此时标记logfileparallelwrite等待完成,Post相关信息给前台进程

4.前台进程被LGWR唤醒,前台进程得到CPU时间片并且标记logfilesync等待完成

通过snapper脚本来度量LGWR的速度:



[html]viewplaincopy

<spanstyle="font-family:'ComicSansMS';">---------------------------------------------------------------------------------

SID,USERNAME,TYPE,STATISTIC,DELTA,HDELTA/SEC,%TIME,GRAPH

---------------------------------------------------------------------------------

1096,(LGWR),STAT,messagessent,12,12,

1096,(LGWR),STAT,messagesreceived,10,10,

1096,(LGWR),STAT,backgroundtimeouts,1,1,

1096,(LGWR),STAT,physicalwritetotalIOrequests,40,40,

1096,(LGWR),STAT,physicalwritetotalmultiblockrequest,38,38,

1096,(LGWR),STAT,physicalwritetotalbytes,2884608,2.88M,

1096,(LGWR),STAT,callstokcmgcs,20,20,

1096,(LGWR),STAT,redowastage,4548,4.55k,

1096,(LGWR),STAT,redowrites,10,10,

1096,(LGWR),STAT,redoblockswritten,2817,2.82k,

1096,(LGWR),STAT,redowritetime,25,25,

1096,(LGWR),WAIT,LGWRwaitonLNS,1040575,1.04s,104.1%,|@@@@@@@@@@|

1096,(LGWR),WAIT,logfileparallelwrite,273837,273.84ms,27.4%,|@@@|

1096,(LGWR),WAIT,eventsinwaitclassOther,1035172,1.04s,103.5%,|@@@@@@@@@@|</span>


LGWR和AsynchIO




[html]viewplaincopy

oracle@linux01:~$strace-cp`pgrep-flgwr`

Process12457attached-interrupttoquit

^CProcess12457detached

%timesecondsusecs/callcallserrorssyscall

------------------------------------------------------------

100.000.010000263383semtimedop

0.000.0000000213times

0.000.00000008getrusage

0.000.0000000701gettimeofday

0.000.000000041io_getevents

0.000.000000041io_submit

0.000.00000002semop

0.000.000000037semctl

------------------------------------------------------------

100.000.01000010813total


***io_getevents是在AIO阶段logfileparallelwrite等待事件度量
Redo,commit相关的latchtuning
1.redoallocationlatches-故名思议,在私有现成写redo到logbuffer时保护分配空间的latch
2.redocopylatches-当从私有内存区域copyredo到logbuffer时需要的latch直到相关redo流被copy到logbuffer,,那么LGWR进程
直到已经copy完成可以写buffers到磁盘,此时LGWR将等待LGWRwaitforredocopy事件,相关的可以被调整的参数:_log_simultaneous_copies
等待事件:
1.logfilesync
2.logfileparallelwrite
3.logfilesinglewrite
可以获取相关的统计信息(v$sesstat,v$sysstat)
(1.redosize2.redowritingtime3.usercommits4userrollbacks5.transactionrollbacks)
下面看一个非commit问题的等待事件:logbufferspace-此事件主要是由于redologbuffer太小,LGWR刷出redo导致争用,或者由于IO子系统太慢.根据很多人的经验,相对logbuffer设置大一点能够缓解logfilesync,这只是相对而言.如果你滴业务类型,每次commit都是比较大的写入,而且系统的整个IO已经达到存储子系统的瓶颈,那么增大logbuffer将是无济于事的。根据MOS的很多文档参考,在10g中还是不建议设置次参数。
logfilesinglewrite:
单块写redoIO大多数情况下仅仅用于logfileheaderblock的读和写,其中logswitch是主要的情况,当归档发生时需要updatelogheader,所以可能是LGWR和ARCH等待此事件。
如下是logswitch发生时的trace文件:


[html]viewplaincopy

WAIT#0:nam='logfilesequentialread'ela=12607log#=0

block#=1

WAIT#0:nam='logfilesequentialread'ela=21225log#=1

block#=1

WAIT#0:nam='controlfilesequentialread'ela=358file#=0

WAIT#0:nam='logfilesinglewrite'ela=470log#=0block#=1

WAIT#0:nam='logfilesinglewrite'ela=227log#=1block#=1


从10.2.0.3+开始如果logwrite等待太久,oracle将dump相关的信息:


[html]viewplaincopy

LGWRtracefile:

***2012-06-1011:36:06.759

Warning:logwritetime690ms,size19KB

***2012-06-1011:37:23.778

Warning:logwritetime52710ms,size0KB

***2012-06-1011:37:27.302

Warning:logwritetime3520ms,size144KB

看下面我们某库的AWR信息:


[html]viewplaincopy

LoadProfilePerSecondPerTransactionPerExecPerCall

~~~~~~~~~~~~--------------------------------------------------

Redosize:15,875,849.0121,482.8

Logicalreads:42,403.5324.5

Blockchanges:34,759.1266.0

Physicalreads:46.00.4

Physicalwrites:3,417.926.2

Usercalls:569.64.4

Parses:292.32.2

Hardparses:0.10.0

W/AMBprocessed:0.50.0

Logons:10.70.1

Executes:552.84.2

Rollbacks:42.90.3

Transactions:130.7

[html]viewplaincopy

Top5TimedForegroundEvents

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Avg

wait%DB

EventWaitsTime(s)(ms)timeWaitClass

---------------------------------------------------------------------------

DBCPU37,30176.5

logfilesync1,665,9007,732515.9Commit

dbfilesequentialread711,2216,614913.6UserI/O

bufferbusywaits366,5894401.9Concurrenc

gccurrentmultiblockrequest192,7912301.5Cluster

这是某库的2号节点,还不算太忙,只是由于业务类型每次写入的量都很大,logfilesync等待占用DBtime的16%

看看后台进程等待:


[html]viewplaincopy

<spanstyle="font-family:ComicSansMS;">Avg

%TimeTotalWaitwaitWaits%bg

EventWaits-outsTime(s)(ms)/txntime

--------------------------------------------------------------------------

dbfileparallelwrite11,968,325024,48125.771.2

logfileparallelwrite1,503,19203,86330.711.2</span>

***如上信息logfilesync占用了DBtime的16%,avgwait(5),那么LGWR等待占用的比例为多少呢?占用了平均每次相应的40%.那么这主要是因为业务原因,已经达到我们存储的IO瓶颈.

***此库平均每s的吞吐量在200M左右

下面看看我使用脚本lfsdiag.sql收集的部分信息:


[html]viewplaincopy

INST_IDEVENTWAIT_TIME_MILLIWAIT_COUNT

---------------------------------------------------------------------------

1waitforscnack14243024

1waitforscnack2728196

1waitforscnack41133400

1waitforscnack81157120

1waitforscnack1688333

1waitforscnack323883

1waitforscnack64429

1waitforscnack12880

1waitforscnack25634

1waitforscnack51248

2waitforscnack155024800

2waitforscnack26658764

2waitforscnack46802492

2waitforscnack84400949

2waitforscnack16564950

2waitforscnack3221712

2waitforscnack643190

2waitforscnack128912

2waitforscnack256390

2waitforscnack512508

1logfilesync149708644

1logfilesync24285471

1logfilesync43929029

1logfilesync82273533

1logfilesync16709349

1logfilesync32257827

1logfilesync6410464

1logfilesync1282371

1logfilesync2561582

1logfilesync5121979

1logfilesync10241200

2logfilesync1647580137

2logfilesync256421028

2logfilesync442559988

2logfilesync826002340

2logfilesync1612821558

2logfilesync324429073

2logfilesync64229397

2logfilesync12842685

2logfilesync25622693

2logfilesync51223922

2logfilesync1024215090

1logfileswitchcompletion1141

1logfileswitchcompletion227

1logfileswitchcompletion435

1logfileswitchcompletion872

1logfileswitchcompletion16237

1logfileswitchcompletion32453

1logfileswitchcompletion64387

1logfileswitchcompletion12831

2logfileswitchcompletion1956

2logfileswitchcompletion2508

2logfileswitchcompletion41005

2logfileswitchcompletion81858

2logfileswitchcompletion164506

2logfileswitchcompletion325569

2logfileswitchcompletion646957

2logfileswitchcompletion128319

2logfileswitchcompletion25624

2logfileswitchcompletion512108

2logfileswitchcompletion10241

1logfileparallelwrite156713575

1logfileparallelwrite22952904

1logfileparallelwrite41832942

1logfileparallelwrite8785097

1logfileparallelwrite16386755

1logfileparallelwrite32229099

1logfileparallelwrite648552

1logfileparallelwrite1281461

1logfileparallelwrite256914

1logfileparallelwrite512231

1logfileparallelwrite102421

1logfileparallelwrite20483

2logfileparallelwrite1708078642

2logfileparallelwrite231616460

2logfileparallelwrite416087368

2logfileparallelwrite85656461

2logfileparallelwrite163121042

2logfileparallelwrite321995505

2logfileparallelwrite6444298

2logfileparallelwrite1287506

2logfileparallelwrite2562582

2logfileparallelwrite512536

2logfileparallelwrite1024464

2logfileparallelwrite204826

2logfileparallelwrite40960

2logfileparallelwrite81920

2logfileparallelwrite163840

2logfileparallelwrite327680

2logfileparallelwrite655360

2logfileparallelwrite1310720

2logfileparallelwrite2621440

2logfileparallelwrite5242881

1gcslogflushsync14366103

1gcslogflushsync272108

1gcslogflushsync452374

1gcslogflushsync823374

INST_IDEVENTWAIT_TIME_MILLIWAIT_COUNT

---------------------------------------------------------------------------

1gcslogflushsync1613518

1gcslogflushsync3212450

1gcslogflushsync6411307

1gcslogflushsync1284

2gcslogflushsync19495464

2gcslogflushsync2263718

2gcslogflushsync4222876

2gcslogflushsync8148562

2gcslogflushsync1668586

2gcslogflushsync3233704

2gcslogflushsync645231

2gcslogflushsync1281

1gccurrentblock2-way130064919

1gccurrentblock2-way2353563

1gccurrentblock2-way4239425

1gccurrentblock2-way829994

1gccurrentblock2-way163203

1gccurrentblock2-way321661

1gccurrentblock2-way641501

1gccurrentblock2-way128273

1gccurrentblock2-way256153

1gccurrentblock2-way51222

1gccurrentblock2-way1024119

2gccurrentblock2-way136168617

2gccurrentblock2-way2303236

2gccurrentblock2-way4148934

2gccurrentblock2-way813304

2gccurrentblock2-way162140

2gccurrentblock2-way321635

2gccurrentblock2-way641114

2gccurrentblock2-way128210

2gccurrentblock2-way25628

2gccurrentblock2-way51212

2gccurrentblock2-way102412

2gccurrentblock2-way20483

2gccurrentblock2-way40962

1gccrgrant2-way176502000

1gccrgrant2-way2476023

1gccrgrant2-way4564802

1gccrgrant2-way861560

1gccrgrant2-way165657

1gccrgrant2-way323011

1gccrgrant2-way64440

1gccrgrant2-way128217

1gccrgrant2-way2566

2gccrgrant2-way1155966394

2gccrgrant2-way2740788

2gccrgrant2-way4748834

2gccrgrant2-way859464

2gccrgrant2-way169889

2gccrgrant2-way327236

2gccrgrant2-way64937

2gccrgrant2-way128463

2gccrgrant2-way25614

2gccrgrant2-way5125

2gccrgrant2-way102410

2gccrgrant2-way20482

2gccrgrant2-way40964

2gccrgrant2-way81921

1gcbufferbusy134252868

1gcbufferbusy218723990

1gcbufferbusy49528539

1gcbufferbusy84351426

1gcbufferbusy163691918

1gcbufferbusy32755331

1gcbufferbusy6468712

1gcbufferbusy12810869

1gcbufferbusy2562553

1gcbufferbusy512337

1gcbufferbusy10242933

2gcbufferbusy17881434

2gcbufferbusy22083189

2gcbufferbusy41372486

2gcbufferbusy81957290

2gcbufferbusy161417604

2gcbufferbusy32448992

2gcbufferbusy64544446

2gcbufferbusy128202888

2gcbufferbusy25658584

2gcbufferbusy51216470

2gcbufferbusy102491266

2gcbufferbusy204814

1LGWRwaitforredocopy1278115

1LGWRwaitforredocopy23698

1LGWRwaitforredocopy48498

1LGWRwaitforredocopy8220

1LGWRwaitforredocopy166

1LGWRwaitforredocopy321

2LGWRwaitforredocopy17935371

2LGWRwaitforredocopy229915

2LGWRwaitforredocopy458179

2LGWRwaitforredocopy82472

2LGWRwaitforredocopy16204

2LGWRwaitforredocopy3247

此信息主要来自于V$EVENT_HISTOGRAM,对于判断到底是什么引起的问题,有助于参考。

***Oracle有个bug:logfilesync等待1s,cursor:pinswaitonx也有个bug,莫名等待10ms.

如何tuningsqllogfile:

对于logbuffersize的大小,前面已经提到过,默认的就已经足够大了,但是从经验来看(主要是前辈们,我也尝试过,效果不太明显),所以在allocate相关latch等待,比较多的时候可以考虑增大logbuffersize.9.2以后oracle使用多个logbuffer,也在很大程度上缓解了相关latch等待(每个latch保护自己的buffers).10g出现新功能:privateredostrand,每一个allocatelatch保护自己的redostrand,也出现了IMU的概念,所以logbuffer相关latch的争用已经很少出现.

下面看看相关的参数调整:

10gR1:commit_logging

10gR2:commit_write


Option

Effect

Wait(default)

Ensuresthatthecommitreturnsonlyafterthecorrespondingredoinformationispersistentintheonlineredolog.WhentheclientreceivesasuccessfulreturnfromthisCOMMITstatement,thetransactionhasbeencommittedtodurablemedia.

Afailurethatoccursafterasuccessfulwritetothelogmightpreventthesuccessmessagefromreturningtotheclient,inwhichcasetheclientcannottellwhetherornotthetransactioncommitted.

Nowait

Thecommitreturnstotheclientwhetherornotthewritetotheredologhascompleted.Thisbehaviorcanincreasetransactionthroughput.

Batch

Theredoinformationisbufferedtotheredolog,alongwithotherconcurrentlyexecutingtransactions.Whensufficientredoinformationiscollected,adiskwritetotheredologisinitiated.Thisbehavioriscalledgroupcommit,asredoinformationformultipletransactionsiswrittentotheloginasingleI/Ooperation.

Immediate(default)

LGWRwritesthetransaction'sredoinformationtothelog.BecausethisoperationoptionforcesadiskI/O,itcanreducetransactionthroughput.


以上来自于OnlineDocument,如果你不关心ACID的D(持久性),也就是不关心instancecrash后丢失数据的风险,完全可以采用nowait,但我目前没有见过系统使用该参数,都为默认值.

可能在很多情况下,我们从单纯的DB的层面去tuning,并不能得到很好的效果,从应用层面能够得到很好的效果,但是推动应用修改代码又是一件难而又难的事情。

1.PLSQL,这包括Procedure导致的logfilesync,看下面例子:


[html]viewplaincopy

SW_SIDSTATESW_EVENTSEQ#SEC_IN_WAITSQL_IDSQL_TEXT

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2962WAITINGlogfilesync604400773b3nqmcvwf4BEGINP_MS_UPDATE_SENDTASK(:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14,:15,:16,:17);END;

2962WAITINGlogfilesync604400773b3nqmcvwf4BEGINP_MS_UPDATE_SENDTASK(:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14,:15,:16,:17);END;

对于这种类型的存过,里面有各种update,insert,delete,每次的处理量比较大,所以我们只能去修改,分散相应的业务逻辑.是每次提交尽可能以合理的批量来做

CPU方面:

也有种可能是在CPU的配置上来优化,LGWR消耗大量的CPU,做法是如果LGWR等待的延迟相当严重,那么可以把LGWR调整到高优先级

IO方面:

如果你的存储IO存在瓶颈,那么logfileparallelwrite事件会比较明显,所以这个调整还是从存储级别,比如采用rawdevice,ASM,更加快速的存储设备等等

下面是如上Begin....End执行的系统的logfilesync曲线,当此过程大量执行的时候,平均等待的时间有所增加,如下图:



内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: