您的位置:首页 > 产品设计 > UI/UE

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'

2016-11-10 11:26 627 查看
凌晨收到同事电话,反馈应用程序访问Oracle数据库时报错,当时现场现象确认:1.应用程序访问不了数据库,使用SQLDeveloper测试发现访问不了数据库。报ORA-12570TNS:packetreaderfailure2.使用lsnrctlstatus检查监听,一直没有响应,这个是极少见的情况。3.检查数据库状态为OPEN,使用nmon检查系统资源。如下一张截图所示,CPU利用率不高,但是CPUWait%非常高。这意味着I/O不正常。可能出现了IO等待和争用(IOwaitsandcontention)CPUWait%:显示采集间隔内所有CPU处于空闲且等待I/O完成的时间比例,Wait%是CPU空闲状态的一种,当CPU处于空闲状态而又有进程处于D状态(不可中断睡眠)时,系统会统计这时的时间,并计算到Wait%里,Wait%不是一个时间值,而是时间的比例,因此在同样I/OWait时间下,服务器CPU越多,Wait%越低,它体现了I/O操作与计算操作之间的比例。对I/O密集型的应用来说一般Wait%较高.)

4.打开邮件发现收到大量的监控告警日志作业发出的邮件,检查告警日志,发现里面有大量ORA错误信息,部分内容如下:

3||ORA-00494:enqueue[CF]heldfortoolong(morethan900seconds)by'inst1,osid5166'
10||ORA-00494:enqueue[CF]heldfortoolong(morethan900seconds)by'inst1,osid5166'
17||ORA-00494:enqueue[CF]heldfortoolong(morethan900seconds)by'inst1,osid5166'
24||ORA-00494:enqueue[CF]heldfortoolong(morethan900seconds)by'inst1,osid5166'
31||ORA-00494:enqueue[CF]heldfortoolong(morethan900seconds)by'inst1,osid5166'
38||ORA-00239:timeoutwaitingforcontrolfileenqueue:heldby'inst1,osid5166'formorethan900seconds
41||ORA-00494:enqueue[CF]heldfortoolong(morethan900seconds)by'inst1,osid5166'
48||ORA-00239:timeoutwaitingforcontrolfileenqueue:heldby'inst1,osid5166'formorethan900seconds

关于“ORA-00494:enqueue[CF]heldfortoolong(morethan900seconds).....”这个错误,我们先看看这个错误的相关描述:

[oracle@DB-Server~]$oerrora494
00494,00000,"enqueue%sheldfortoolong(morethan%sseconds)by'inst%s,osid%s'"
//*Cause:Thespecifiedprocessdidnotreleasetheenqueuewithin
//themaximumallowedtime.
//*Action:ReissueanycommandsthatfailedandcontactOracleSupport
//Serviceswiththeincidentinformation.

出现ORA-00494意味这InstanceCrash了,可以参考官方文档DatabaseCrashesWithORA-00494(文档ID753290.1):
ThiserrorcanalsobeaccompaniedbyORA-600[2103]whichisbasicallythesameproblem-aprocesswasunabletoobtaintheCFenqueuewithinthespecifiedtimeout(default900seconds).
Thisbehaviorcanbecorrelatedwithserverhighloadandhighconcurrencyonresources,IOwaitsandcontention,whichkeeptheOraclebackgroundprocessesfromreceivingthenecessaryresources.
Cause#1:Thelgwrhaskilledtheckptprocess,causingtheinstancetocrash.
Fromthealert.logwecansee:
ThedatabasehaswaitedtoolongforaCFenqueue,sothenexterrorisreported:
ORA-00494:enqueue[CF]heldfortoolong(morethan900seconds)by'inst1,osid38356'
ThentheLGWRkilledtheblocker,whichwasinthiscasetheCKPTprocesswhichthencausestheinstancetocrash.
Checkingthealert.logfurtherwecanseethatthefrequencyofredologfilesswitchisveryhigh(almostevery1min).
Cause#2:CheckingtheI/OStateintheAWRreportwefindthat:
AverageReadperms(AvRd(ms))forthedatabasefileswhicharelocatedonthismountpoint"/oracle/oa1l/data/"isfacingI/Oissueasperthedatacollectionwhichwasperform
Cause#3:TheproblemhasbeeninvestigatedinBug7692631-'DATABASECRASHESWITHORA-494AFTERUPGRADETO10.2.0.4'
andunpublishedBug7914003'KILLBLOCKERAFTERORA-494LEADSTOFATALBGPROCESSBEINGKILLED'
TheORA-00494erroroccursduringperiodsofsuper-highstress,activitytothepointtheretheserverbecomesunresponsiveduetooverloadeddiskI/O,CPUorRAM.
从上面分析看,这三种原因都存在可能性。但是需要跟多的信息和证据来确认到底是什么原因导致ORA-00494错误,以至数据库实例Crash。
1:告警日志里面有“ORA-00494:enqueue[CF]heldfortoolong(morethan900seconds)by'inst1,osid5166'”错误,CF指Controlfileschemaglobalenqueue。如果一个进程在指定的时间(默认900秒)内无法获得CF锁,则CF锁的执行进程会被kill。这个参数为_controlfile_enqueue_timeout

SQL>COLNAMEFORA45;
SQL>COLVALUEFORA32;
SQL>COLDESCRFORA80;
SQL>SELECTx.ksppinmNAME
2,y.ksppstvlVALUE
3,x.ksppdescDESCR
4FROMSYS.x$ksppix,SYS.x$ksppcvy
5WHEREx.indx=y.indx
6ANDx.ksppinmLIKE'%&par%';
Entervalueforpar:controlfile_enqueue
old6:ANDx.ksppinmLIKE'%&par%'
new6:ANDx.ksppinmLIKE'%controlfile_enqueue%'
NAMEVALUEDESCR
-------------------------------------------------------------------------------------------------------------------------------------------------------------
_controlfile_enqueue_timeout900controlfileenqueuetimeoutinseconds
_controlfile_enqueue_holding_time120controlfileenqueuemaxholdingtimeinseconds
_controlfile_enqueue_dumpFALSEdumpthesystemstatesaftercontrolfileenqueuetimeout
_kill_controlfile_enqueue_blockerTRUEenablekillingcontrolfileenqueueblockerontimeout
SQL>


检查redolog的切换频率,发现在2016-11-09零点到2点,以及2016-11-0822:00~24:00的redolog切换频率都很低。排除有大量DML操作的可能性,根据以上一些分析,我们还不能完全排除Cause#1。我们接着分析其他信息

SELECT
TO_CHAR(FIRST_TIME,'YYYY-MM-DD')DAY,
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'99')"00",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'99')"01",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'99')"02",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'99')"03",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'99')"04",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'99')"05",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'99')"06",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'99')"07",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'99')"08",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'99')"09",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'99')"10",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'99')"11",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'99')"12",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'99')"13",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'99')"14",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'99')"15",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'99')"16",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'99')"17",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'99')"18",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'99')"19",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'99')"20",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'99')"21",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'99')"22",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'99')"23"
FROM
V$LOG_HISTORY
GROUPBY
TO_CHAR(FIRST_TIME,'YYYY-MM-DD')
ORDERBY1DESC;




2:关于TheproblemhasbeeninvestigatedinBug7692631-'DATABASECRASHESWITHORA-494AFTERUPGRADETO10.2.0.4'
andunpublishedBug7914003'KILLBLOCKERAFTERORA-494LEADSTOFATALBGPROCESSBEINGKILLED'
告警日志里面出现ORA-00239,但是没有出现ORA-603、ORA-00470之类的错误。按照官方文档DiskI/OContention/SlowCanLeadtoORA-239andInstanceCrash(文档ID1068799.1)
I/Ocontentionorslownessleadstocontrolfileenqueuetimeout.
OneparticularsituationthatcanbeseenisLGWRtimeoutwhilewaitingforcontrolfileenqueue,andtheblockerisCKPT:
FromtheAWR:
1)high"logfileparallelwrite"and"controlfilesequentialread"waits
2)VeryslowTablespaceI/O,AvRd(ms)of1000-4000ms(whenlowerthan20msisacceptable)
3)veryhigh%iowait:98.57%.
4)confirmedIOpeakduringthattime
Pleasenote:Remotearchivedestinationisalsoapossiblecause.Networkingissuescanalsocausethistypeofissuewhenaremotearchivedestinationisinuseforastandbydatabase.
这台服务器已经正常运行了很多年,所以我们更倾向是IO问题导致。结合当时CPUWait%非常高。这意味着可能出现了严重的IO等待和争用(IOwaitsandcontention)
3:我们来看看监控工具OSWather生成这段时间的一些报告,如下,CPU资源非常空闲
OperatingSystemCPUUtilization



CPU等待IO资源(WaitIO)也是从10:45PM(22:45)之后变大。CPU利用率一直不高,最多20%多的样子。



OperatingSystemCPUOther



然后,我们看看OperatingSystemI/O吧,如下截图所示,可以看出在11点开始,系统IO设备非常繁忙由此我们可以判断IO异常导致数据库出现ORA-00494错误的可能性很大。
OperatingSystemI/O






OperatingSystemI/OThroughput






然后我们检查一下操作系统的日志,如下所示:



如下截图所示,“INFO:taskkjournald:xxxblockedformorethan120seconds.”从23:22开始,在这之前,出现大量这类日志信息。这个是因为PlateSpin的作业复制导致(后面确认该作业在22:40启动)。所以至此,我们更倾向是因为第二个源于引起数据库InstanceCrash。后面和系统管理员确认,PlateSpin的复制作业也是失败了。所以种种分析,非常怀疑是PlateSpin的作业引起了IO异常。而IO发生短暂或长时间停止响应的时候,就导致数据库实例崩溃。









后续处理解决
此时使用shutdownimmediate关闭不了数据库,没有任何响应。只能shutdownabort,然后启动数据库实例,但是在startup时出现异常,报下面一些错误

ORA-01102:cannotmountdatabaseinEXCLUSIVEmode
ORA-00205:errorinidentifyingcontrolfile,checkalertlogformoreinfo
ORA-00202:controlfile:'/u01/app/oracle/oradata/epps/control01.ctl'
ORA-27086:unabletolockfile-alreadyinuse

关于这个错误,此处不做展开,可以参考ORA-01102:cannotmountdatabaseinEXCLUSIVEmode,kill掉大部分进程后,发现有三个进程使用kill-9kill不掉,如下截图所示:



kill-9发送SIGKILL信号将其终止,但是以下两种情况不起作用:
a、该进程处于”Zombie”状态(使用ps命令返回defunct的进程)。此时进程已经释放所有资源,但还未得到其父进程的确认。”Zombie”进程要等到下次重启时才会消失,但它的存在不会影响系统性能。
b、该进程处于”kernelmode”(核心态)且在等待不可获得的资源。处于核心态的进程忽略所有信号处理,因此对于这些一直处于核心态的进程只能通过重启系统实现。进程在Linux中会处于两种状态,即用户态和核心态。只有处于用户态的进程才可以用“kill”命令将其终止。
由于这些进程已经陷入核心态,而且很难自动唤醒,又不接受信号指令。不得已只能reboot系统了。重启后问题解决。后面和系统管理员协商暂时停用PlateSpin作业,待周日重新做一个完整备份后,继续观察IO影响。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐