理解LGWR,Log File Sync Waits以及Commit的性能问题[转]
2013-11-14 11:32
411 查看
理解LGWR,LogFileSyncWaits以及Commit的性能问题
一.概要:
1.Commit和logfilesync的工作机制
2.为什么logfilewait太久
3.如何去度量问题出在那里呢?
二.logfilesync等待的原因
1.默认情况下我们commit一个事务是要等待logfilesync,这其中包括:
(1)Usercommit(用户提交的统计信息可以通过v$sesstat来查看)
(2)DDL-这一部分主要是由于递归的事务提交所产生
(3)递归的数据字典DML操作
2.Rollbacks导致logfilesync等待
(1)Userrollbacks-用户或者由应用发出的rollback操作所致
(2)Transactionrollbacks:1,由于一些失败的操作导致oracle内部的rollback2.空间分配,或者ASSM相关的问题,以及用户取消的长查询,被kill掉的session等等。
下图为Commit和logfilesync相关的流程图:
Logfilesyncperformance>diskIOspeed
****大多数logfilesync的等待时间其实都是花费在logfileparallelwrite,类似与DBWR会等待dbfileparallelwrite
****其它的logfilesync等待花费在调度延迟,IPC通信延迟等等
1.前台进程对LGWR发出调用,然后到sleep状态下面看看Logfilesync等待的整个流程:
此时logfilesync等待开始记数
次调用在Unix平台是通过信号量来实现
2.LGWR被唤醒,得到CPU时间片来工作
LGWR发出IO请求
LGWR转去sleep,并且等待logfileparallelwrite
3.当在存储级别完成IO调用后OS唤醒LGWR进程
LGWR继续去获得CPU时间片
此时标记logfileparallelwrite等待完成,Post相关信息给前台进程
4.前台进程被LGWR唤醒,前台进程得到CPU时间片并且标记logfilesync等待完成
通过snapper脚本来度量LGWR的速度:
[html]
<spanstyle="font-family:'ComicSansMS';">---------------------------------------------------------------------------------
SID,USERNAME,TYPE,STATISTIC,DELTA,HDELTA/SEC,%TIME,GRAPH
---------------------------------------------------------------------------------
1096,(LGWR),STAT,messagessent,12,12,
1096,(LGWR),STAT,messagesreceived,10,10,
1096,(LGWR),STAT,backgroundtimeouts,1,1,
1096,(LGWR),STAT,physicalwritetotalIOrequests,40,40,
1096,(LGWR),STAT,physicalwritetotalmultiblockrequest,38,38,
1096,(LGWR),STAT,physicalwritetotalbytes,2884608,2.88M,
1096,(LGWR),STAT,callstokcmgcs,20,20,
1096,(LGWR),STAT,redowastage,4548,4.55k,
1096,(LGWR),STAT,redowrites,10,10,
1096,(LGWR),STAT,redoblockswritten,2817,2.82k,
1096,(LGWR),STAT,redowritetime,25,25,
1096,(LGWR),WAIT,LGWRwaitonLNS,1040575,1.04s,104.1%,|@@@@@@@@@@|
1096,(LGWR),WAIT,logfileparallelwrite,273837,273.84ms,27.4%,|@@@|
1096,(LGWR),WAIT,eventsinwaitclassOther,1035172,1.04s,103.5%,|@@@@@@@@@@|</span>
LGWR和AsynchIO
[html]
oracle@linux01:~$strace-cp`pgrep-flgwr`
Process12457attached-interrupttoquit
^CProcess12457detached
%timesecondsusecs/callcallserrorssyscall
------------------------------------------------------------
100.000.010000263383semtimedop
0.000.0000000213times
0.000.00000008getrusage
0.000.0000000701gettimeofday
0.000.000000041io_getevents
0.000.000000041io_submit
0.000.00000002semop
0.000.000000037semctl
------------------------------------------------------------
100.000.01000010813total
***io_getevents是在AIO阶段logfileparallelwrite等待事件度量
Redo,commit相关的latchtuning
1.redoallocationlatches-故名思议,在私有现成写redo到logbuffer时保护分配空间的latch
2.redocopylatches-当从私有内存区域copyredo到logbuffer时需要的latch直到相关redo流被copy到logbuffer,,那么LGWR进程
直到已经copy完成可以写buffers到磁盘,此时LGWR将等待LGWRwaitforredocopy事件,相关的可以被调整的参数:_log_simultaneous_copies
等待事件:
1.logfilesync
2.logfileparallelwrite
3.logfilesinglewrite
可以获取相关的统计信息(v$sesstat,v$sysstat)
(1.redosize2.redowritingtime3.usercommits4userrollbacks5.transactionrollbacks)
下面看一个非commit问题的等待事件:logbufferspace-此事件主要是由于redologbuffer太小,LGWR刷出redo导致争用,或者由于IO子系统太慢.根据很多人的经验,相对logbuffer设置大一点能够缓解logfilesync,这只是相对而言.如果你滴业务类型,每次commit都是比较大的写入,而且系统的整个IO已经达到存储子系统的瓶颈,那么增大logbuffer将是无济于事的。根据MOS的很多文档参考,在10g中还是不建议设置次参数。
logfilesinglewrite:
单块写redoIO大多数情况下仅仅用于logfileheaderblock的读和写,其中logswitch是主要的情况,当归档发生时需要updatelogheader,所以可能是LGWR和ARCH等待此事件。
如下是logswitch发生时的trace文件:
[html]
WAIT#0:nam='logfilesequentialread'ela=12607log#=0
block#=1
WAIT#0:nam='logfilesequentialread'ela=21225log#=1
block#=1
WAIT#0:nam='controlfilesequentialread'ela=358file#=0
WAIT#0:nam='logfilesinglewrite'ela=470log#=0block#=1
WAIT#0:nam='logfilesinglewrite'ela=227log#=1block#=1
从10.2.0.3+开始如果logwrite等待太久,oracle将dump相关的信息:
[html]
LGWRtracefile:
***2012-06-1011:36:06.759
Warning:logwritetime690ms,size19KB
***2012-06-1011:37:23.778
Warning:logwritetime52710ms,size0KB
***2012-06-1011:37:27.302
Warning:logwritetime3520ms,size144KB
看下面我们某库的AWR信息:
[html]
LoadProfilePerSecondPerTransactionPerExecPerCall
~~~~~~~~~~~~--------------------------------------------------
Redosize:15,875,849.0121,482.8
Logicalreads:42,403.5324.5
Blockchanges:34,759.1266.0
Physicalreads:46.00.4
Physicalwrites:3,417.926.2
Usercalls:569.64.4
Parses:292.32.2
Hardparses:0.10.0
W/AMBprocessed:0.50.0
Logons:10.70.1
Executes:552.84.2
Rollbacks:42.90.3
Transactions:130.7
[html]
Top5TimedForegroundEvents
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Avg
wait%DB
EventWaitsTime(s)(ms)timeWaitClass
---------------------------------------------------------------------------
DBCPU37,30176.5
logfilesync1,665,9007,732515.9Commit
dbfilesequentialread711,2216,614913.6UserI/O
bufferbusywaits366,5894401.9Concurrenc
gccurrentmultiblockrequest192,7912301.5Cluster
这是某库的2号节点,还不算太忙,只是由于业务类型每次写入的量都很大,logfilesync等待占用DBtime的16%
看看后台进程等待:
[html]
<spanstyle="font-family:ComicSansMS;">Avg
%TimeTotalWaitwaitWaits%bg
EventWaits-outsTime(s)(ms)/txntime
--------------------------------------------------------------------------
dbfileparallelwrite11,968,325024,48125.771.2
logfileparallelwrite1,503,19203,86330.711.2</span>
***如上信息logfilesync占用了DBtime的16%,avgwait(5),那么LGWR等待占用的比例为多少呢?占用了平均每次相应的40%.那么这主要是因为业务原因,已经达到我们存储的IO瓶颈.
***此库平均每s的吞吐量在200M左右
下面看看我使用脚本lfsdiag.sql收集的部分信息:
[html]
INST_IDEVENTWAIT_TIME_MILLIWAIT_COUNT
---------------------------------------------------------------------------
1waitforscnack14243024
1waitforscnack2728196
1waitforscnack41133400
1waitforscnack81157120
1waitforscnack1688333
1waitforscnack323883
1waitforscnack64429
1waitforscnack12880
1waitforscnack25634
1waitforscnack51248
2waitforscnack155024800
2waitforscnack26658764
2waitforscnack46802492
2waitforscnack84400949
2waitforscnack16564950
2waitforscnack3221712
2waitforscnack643190
2waitforscnack128912
2waitforscnack256390
2waitforscnack512508
1logfilesync149708644
1logfilesync24285471
1logfilesync43929029
1logfilesync82273533
1logfilesync16709349
1logfilesync32257827
1logfilesync6410464
1logfilesync1282371
1logfilesync2561582
1logfilesync5121979
1logfilesync10241200
2logfilesync1647580137
2logfilesync256421028
2logfilesync442559988
2logfilesync826002340
2logfilesync1612821558
2logfilesync324429073
2logfilesync64229397
2logfilesync12842685
2logfilesync25622693
2logfilesync51223922
2logfilesync1024215090
1logfileswitchcompletion1141
1logfileswitchcompletion227
1logfileswitchcompletion435
1logfileswitchcompletion872
1logfileswitchcompletion16237
1logfileswitchcompletion32453
1logfileswitchcompletion64387
1logfileswitchcompletion12831
2logfileswitchcompletion1956
2logfileswitchcompletion2508
2logfileswitchcompletion41005
2logfileswitchcompletion81858
2logfileswitchcompletion164506
2logfileswitchcompletion325569
2logfileswitchcompletion646957
2logfileswitchcompletion128319
2logfileswitchcompletion25624
2logfileswitchcompletion512108
2logfileswitchcompletion10241
1logfileparallelwrite156713575
1logfileparallelwrite22952904
1logfileparallelwrite41832942
1logfileparallelwrite8785097
1logfileparallelwrite16386755
1logfileparallelwrite32229099
1logfileparallelwrite648552
1logfileparallelwrite1281461
1logfileparallelwrite256914
1logfileparallelwrite512231
1logfileparallelwrite102421
1logfileparallelwrite20483
2logfileparallelwrite1708078642
2logfileparallelwrite231616460
2logfileparallelwrite416087368
2logfileparallelwrite85656461
2logfileparallelwrite163121042
2logfileparallelwrite321995505
2logfileparallelwrite6444298
2logfileparallelwrite1287506
2logfileparallelwrite2562582
2logfileparallelwrite512536
2logfileparallelwrite1024464
2logfileparallelwrite204826
2logfileparallelwrite40960
2logfileparallelwrite81920
2logfileparallelwrite163840
2logfileparallelwrite327680
2logfileparallelwrite655360
2logfileparallelwrite1310720
2logfileparallelwrite2621440
2logfileparallelwrite5242881
1gcslogflushsync14366103
1gcslogflushsync272108
1gcslogflushsync452374
1gcslogflushsync823374
INST_IDEVENTWAIT_TIME_MILLIWAIT_COUNT
---------------------------------------------------------------------------
1gcslogflushsync1613518
1gcslogflushsync3212450
1gcslogflushsync6411307
1gcslogflushsync1284
2gcslogflushsync19495464
2gcslogflushsync2263718
2gcslogflushsync4222876
2gcslogflushsync8148562
2gcslogflushsync1668586
2gcslogflushsync3233704
2gcslogflushsync645231
2gcslogflushsync1281
1gccurrentblock2-way130064919
1gccurrentblock2-way2353563
1gccurrentblock2-way4239425
1gccurrentblock2-way829994
1gccurrentblock2-way163203
1gccurrentblock2-way321661
1gccurrentblock2-way641501
1gccurrentblock2-way128273
1gccurrentblock2-way256153
1gccurrentblock2-way51222
1gccurrentblock2-way1024119
2gccurrentblock2-way136168617
2gccurrentblock2-way2303236
2gccurrentblock2-way4148934
2gccurrentblock2-way813304
2gccurrentblock2-way162140
2gccurrentblock2-way321635
2gccurrentblock2-way641114
2gccurrentblock2-way128210
2gccurrentblock2-way25628
2gccurrentblock2-way51212
2gccurrentblock2-way102412
2gccurrentblock2-way20483
2gccurrentblock2-way40962
1gccrgrant2-way176502000
1gccrgrant2-way2476023
1gccrgrant2-way4564802
1gccrgrant2-way861560
1gccrgrant2-way165657
1gccrgrant2-way323011
1gccrgrant2-way64440
1gccrgrant2-way128217
1gccrgrant2-way2566
2gccrgrant2-way1155966394
2gccrgrant2-way2740788
2gccrgrant2-way4748834
2gccrgrant2-way859464
2gccrgrant2-way169889
2gccrgrant2-way327236
2gccrgrant2-way64937
2gccrgrant2-way128463
2gccrgrant2-way25614
2gccrgrant2-way5125
2gccrgrant2-way102410
2gccrgrant2-way20482
2gccrgrant2-way40964
2gccrgrant2-way81921
1gcbufferbusy134252868
1gcbufferbusy218723990
1gcbufferbusy49528539
1gcbufferbusy84351426
1gcbufferbusy163691918
1gcbufferbusy32755331
1gcbufferbusy6468712
1gcbufferbusy12810869
1gcbufferbusy2562553
1gcbufferbusy512337
1gcbufferbusy10242933
2gcbufferbusy17881434
2gcbufferbusy22083189
2gcbufferbusy41372486
2gcbufferbusy81957290
2gcbufferbusy161417604
2gcbufferbusy32448992
2gcbufferbusy64544446
2gcbufferbusy128202888
2gcbufferbusy25658584
2gcbufferbusy51216470
2gcbufferbusy102491266
2gcbufferbusy204814
1LGWRwaitforredocopy1278115
1LGWRwaitforredocopy23698
1LGWRwaitforredocopy48498
1LGWRwaitforredocopy8220
1LGWRwaitforredocopy166
1LGWRwaitforredocopy321
2LGWRwaitforredocopy17935371
2LGWRwaitforredocopy229915
2LGWRwaitforredocopy458179
2LGWRwaitforredocopy82472
2LGWRwaitforredocopy16204
2LGWRwaitforredocopy3247
此信息主要来自于V$EVENT_HISTOGRAM,对于判断到底是什么引起的问题,有助于参考。
***Oracle有个bug:logfilesync等待1s,cursor:pinswaitonx也有个bug,莫名等待10ms.
如何tuningsqllogfile:
对于logbuffersize的大小,前面已经提到过,默认的就已经足够大了,但是从经验来看(主要是前辈们,我也尝试过,效果不太明显),所以在allocate相关latch等待,比较多的时候可以考虑增大logbuffersize.9.2以后oracle使用多个logbuffer,也在很大程度上缓解了相关latch等待(每个latch保护自己的buffers).10g出现新功能:privateredostrand,每一个allocatelatch保护自己的redostrand,也出现了IMU的概念,所以logbuffer相关latch的争用已经很少出现.
下面看看相关的参数调整:
10gR1:commit_logging
10gR2:commit_write
Option | Effect |
Wait(default) | Ensuresthatthecommitreturnsonlyafterthecorrespondingredoinformationispersistentintheonlineredolog.WhentheclientreceivesasuccessfulreturnfromthisCOMMITstatement,thetransactionhasbeencommittedtodurablemedia. Afailurethatoccursafterasuccessfulwritetothelogmightpreventthesuccessmessagefromreturningtotheclient,inwhichcasetheclientcannottellwhetherornotthetransactioncommitted. |
Nowait | Thecommitreturnstotheclientwhetherornotthewritetotheredologhascompleted.Thisbehaviorcanincreasetransactionthroughput. |
Batch | Theredoinformationisbufferedtotheredolog,alongwithotherconcurrentlyexecutingtransactions.Whensufficientredoinformationiscollected,adiskwritetotheredologisinitiated.Thisbehavioriscalledgroupcommit,asredoinformationformultipletransactionsiswrittentotheloginasingleI/Ooperation. |
Immediate(default) | LGWRwritesthetransaction'sredoinformationtothelog.BecausethisoperationoptionforcesadiskI/O,itcanreducetransactionthroughput. |
以上来自于OnlineDocument,如果你不关心ACID的D(持久性),也就是不关心instancecrash后丢失数据的风险,完全可以采用nowait,但我目前没有见过系统使用该参数,都为默认值.
可能在很多情况下,我们从单纯的DB的层面去tuning,并不能得到很好的效果,从应用层面能够得到很好的效果,但是推动应用修改代码又是一件难而又难的事情。
1.PLSQL,这包括Procedure导致的logfilesync,看下面例子:
[html]
SW_SIDSTATESW_EVENTSEQ#SEC_IN_WAITSQL_IDSQL_TEXT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2962WAITINGlogfilesync604400773b3nqmcvwf4BEGINP_MS_UPDATE_SENDTASK(:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14,:15,:16,:17);END;
2962WAITINGlogfilesync604400773b3nqmcvwf4BEGINP_MS_UPDATE_SENDTASK(:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14,:15,:16,:17);END;
对于这种类型的存过,里面有各种update,insert,delete,每次的处理量比较大,所以我们只能去修改,分散相应的业务逻辑.是每次提交尽可能以合理的批量来做
CPU方面:
也有种可能是在CPU的配置上来优化,LGWR消耗大量的CPU,做法是如果LGWR等待的延迟相当严重,那么可以把LGWR调整到高优先级
IO方面:
如果你的存储IO存在瓶颈,那么logfileparallelwrite事件会比较明显,所以这个调整还是从存储级别,比如采用rawdevice,ASM,更加快速的存储设备等等
下面是如上Begin....End执行的系统的logfilesync曲线,当此过程大量执行的时候,平均等待的时间有所增加,如下图:
相关文章推荐
- 理解LGWR,Log File Sync Waits以及Commit的性能问题
- 如何在awr里面查看 log file sync是否是由用户commit太多造成
- elasticsearch核心知识--30.分页搜索以及deep paging性能问题深度理解和es中聚合aggregation的分组可能结果不准确的原因
- RAC 性能分析 - 'log file sync' 等待事件
- 2017-04-12 DBA日记,频繁commit导致的log file sync的诊断
- Troubleshooting: 'Log file sync' Waits
- log file sync致数据库性能摇摆
- Log file sync caused by more commit
- session存放数据过大导致频繁GC影响服务器性能以及高并发问题解决
- Spring MVC 解决跨域问题&以及对@RequestBody的理解
- log file sync和log file parallel write的关系
- Statspack之十四-"log file sync" 等待事件
- MySQL5.7 binlog_group_commit_sync_delay对并发复制性能的影响
- sync_binlog innodb_flush_log_at_trx_commit 浅析
- log file sync(日志文件同步) 与 Log file parallel write 等待事件
- eclipse 文件系统不同步问题resource is out of sync with the file system的解决办法
- php中include,require的文件包含问题,以及$_SERVER['PHP_SELF']和__FILE__的区别
- 用Fusion Log诊断"Could not load file or assembly"问题
- 理解性能的奥秘——应用程序中慢,SSMS中快(4)——收集解决参数嗅探问题的信息
- java.io.File的基本使用以及关于Windows中大小写问题