您的位置:首页 > 产品设计 > UI/UE

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) cause instance crash

2017-05-17 10:55 2051 查看
故障现象:alert 日志

Restarting dead background process MMON

Tue May 16 17:24:51 2017

Process 0x0x6b0f14da8 appears to be hung while dumping

Current time = 465593890, process death time = 465533027 interval = 60000

Attempting to kill process 0x0x6b0f14da8 with OS pid = 61258

OSD kill succeeded for process 0x6b0f14da8

  VERSION INFORMATION:

        TNS for Linux: Version 11.2.0.1.0 - Production

        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.1.0 - Production

        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.1.0 - Production

  Time: 16-5?? -2017 17:24:51

  Tracing not turned on.

  Tns error struct:

    ns main err code: 12535

Tue May 16 17:24:52 2017

Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_m000_54933.trc:

ORA-12751: cpu time or run time policy violation

TNS-12535: TNS:operation timed out

    ns secondary err code: 12560

    nt main err code: 505

TNS-00505: Operation timed out

    nt secondary err code: 110

    nt OS err code: 0

  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.17.0.52)(PORT=65434))

Restarting dead background process MMON

Tue May 16 17:24:51 2017

Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc0_16428.trc  (incident=258198):

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'

Tue May 16 17:24:51 2017

Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc2_50083.trc  (incident=258214):

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'

Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258198/LHDW_arc0_16428_i258198.trc

Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258214/LHDW_arc2_50083_i258214.trc

Tue May 16 17:24:51 2017

Thread 1 advanced to log sequence 86188 (LGWR switch)

  Current log# 10 seq# 86188 mem# 0: /oracle/ora11g/oradata/LHDW/redo10.log

  Current log# 10 seq# 86188 mem# 1: /oracle/ora11g/flash_recovery_area/LHDW/redo10a.log

Tue May 16 17:04:56 2017

LNS: Standby redo logfile selected for thread 1 sequence 86188 for destination LOG_ARCHIVE_DEST_2

Tue May 16 17:12:23 2017

***********************************************************************

Tue May 16 17:21:50 2017

AUD: Audit Commit Delay exceeded, written a copy to OS Audit Trail

Fatal NI connect error 12170.

Tue May 16 17:24:51 2017

Process 0x0x6b0f14da8 appears to be hung while dumping

Current time = 465593890, process death time = 465533027 interval = 60000

Attempting to kill process 0x0x6b0f14da8 with OS pid = 61258

OSD kill succeeded for process 0x6b0f14da8

Tue May 16 17:24:52 2017

Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_m000_54933.trc:

ORA-12751: cpu time or run time policy violation

TNS-12535: TNS:operation timed out

    ns secondary err code: 12560

    nt main err code: 505

Restarting dead background process MMON

Tue May 16 17:24:51 2017

Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc0_16428.trc  (incident=258198):

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'

Tue May 16 17:24:51 2017

Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc2_50083.trc  (incident=258214):

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'

Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258198/LHDW_arc0_16428_i258198.trc

Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258214/LHDW_arc2_50083_i258214.trc

Tue May 16 17:24:51 2017

Tue May 16 17:24:52 2017

Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc1_50394.trc  (incident=258205):

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'

Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc3_16457.trc  (incident=258222):

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'

Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258222/LHDW_arc3_16457_i258222.trc

Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258205/LHDW_arc1_50394_i258205.trc

Killing enqueue blocker (pid=4595) on resource CF-00000000-00000000 by (pid=16428)

Killing enqueue blocker (pid=4595) on resource CF-00000000-00000000 by (pid=50083) by killing session 586.1

Killing enqueue blocker (pid=4595) on resource CF-00000000-00000000 by (pid=16428)

 by killing session 586.1

 by terminating the process

Killing enqueue blocker (pid=4595) on resource CF-00000000-00000000 by (pid=50083)ARC0 (ospid: 16428): terminating the instance due to error 2103 by terminating the process

Tue May 16 17:25:04 2017

Instance terminated by ARC0, pid = 16428

[oracle@hadoop-21 trace]$ ls -trl *4595*

-rw-r----- 1 oracle oinstall   894 May 16 17:00 LHDW_ckpt_4595.trm

-rw-r----- 1 oracle oinstall 62781 May 16 17:00 LHDW_ckpt_4595.trc

##CAUSE

I/O contention or slowness leads to control file enqueue timeout.

One particular situation that can be seen is LGWR timeout while waiting for control file enqueue, and the blocker is CKPT :

From the AWR:

1) high "log file parallel write" and "control file sequential read" waits 

2) Very slow Tablespace I/O, Av Rd(ms) of 1000-4000 ms (when lower than 20 ms is acceptable)

3) very high %iowait : 98.57%.
4) confirmed IO peak during that time  本次故障原因

Please note: Remote archive destination is also a possible cause. Networking issues can also cause this type of issue when a remote archive destination is in use for a standby database. 

###SOLUTION

Since the problem is caused by I/O contention/slow, improving the I/O performance will avoid such issue.

It is an expected behavior that the control file enqueue requestor terminates itself with ORA-239 if the state machine failed to kill the blocker. In most cases, crashing the instance will prevent cluster-wide hang. This is the default setting.

There is a workaround to resolve the issue:

Disable kill blocker code via setting _kill_controlfile_enqueue_blocker=false.

 alter system set "_kill_controlfile_enqueue_blocker"=false;

_kill_controlfile_enqueue_blocker is NOT available in 10.2.0.1, it's available from 10.2.0.2.

[oracle@hadoop-21 trace]$ dmesg|grep oracle

dracut: inactive '/dev/lhdw_oracle/lvm_sdb1' [9.91 TiB] inherit

EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283886 by oracle; performance will be poor.

INFO: task oracle:4583 blocked for more than 120 seconds.

oracle        D 000000000000001d     0  4583      1 0x00000080

INFO: task oracle:4585 blocked for more than 120 seconds.

oracle        D 0000000000000001     0  4585      1 0x00000080---------------->>>>ckpt 进程挂起 120秒,无法对控制文件进行io操作,超时持有 cf enqueue 

INFO: task oracle:4587 blocked for more than 120 seconds.

oracle        D 000000000000000a     0  4587      1 0x00000080

INFO: task oracle:4589 blocked for more than 120 seconds.

oracle        D 0000000000000010     0  4589      1 0x00000080

INFO: task oracle:4591 blocked for more than 120 seconds.

oracle        D 0000000000000000     0  4591      1 0x00000080

INFO: task oracle:56454 blocked for more than 120 seconds.

oracle        D 0000000000000000     0 56454      1 0x00000080

INFO: task oracle:2323 blocked for more than 120 seconds.

oracle        D 0000000000000000     0  2323      1 0x00000080

INFO: task oracle:4835 blocked for more than 120 seconds.

oracle        D 0000000000000008     0  4835      1 0x00000080

EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283837 by oracle; performance will be poor.

EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283406 by oracle; performance will be poor.

EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283408 by oracle; performance will be poor.

EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283859 by oracle; performance will be poor.

[oracle@hadoop-21 trace]$ find / -inum 49283837 2 > /dev/null 

##_controlfile_enqueue_timeout

This parameter is used to set the timeout limit for CF enqueue. By default, it is 900 seconds.

This timeout can be reduced using the above parameter so that the process dont have to wait for 15 minutes to get the enqueue.

##_kill_controlfile_enqueue_blocker

_kill_controlfile_enqueue_blocker  FALSE   

  enable killing controlfile enqueue blocker on timeout 

This parameter enables killing controlfile enqueue blocker on timeout.

TRUE.   Default value. Enables this mechanism and kills blocker process in CF enqueue.

FALSE.  Disables this mechanism and no blocker process in CF enqueue will be killed.

This kill blocker interface / ORA-494 was introduced in 10.2.0.4. This new mechanism will kill *any* kind of blocking process, non-background or background.

The difference will be that if the enqueue holder is a non-background process, even if it is killed, the instance can function without it.

In case the holder is a background process, for example the LGWR, the kill of the holder leads to instance crash.

If you want to avoid the kill of the blocker (background or non-background process), you can set _kill_controlfile_enqueue_blocker=false.

This means that no type of blocker will be killed anymore although the resolution to this problem should focus on why the process is holding the enqueue for so long.

Also, you may prefer to only avoid killing background processes, since they are vital to the instance, and you may want to allow the killing of non-background blockers.

In order to prevent a background blocker from being killed, the following init.ora parameter can be set to 1 (default is 2).

_kill_enqueue_blocker=1 

#for 11g:

_kill_enqueue_blocker 2

    if greater than 0 enables killing enqueue blocker

_kill_enqueue_blocker = { 0 | 1 | 2 | 3 }

0. Disables this mechanism and no foreground or background blocker process in enqueue will be killed.

1. Enables this mechanism and only kills foreground blocker process in enqueue while background process is not affected.

2. Enables this mechanism and only kills background blocker process in enqueue.

3. Default value. Enables this mechanism and kills blocker processes in enqueue.

alter system set "_kill_enqueue_blocker"=1 scope=spfile;

请勿在rac 环境修改这个参数:因为:The reason why the blocker interface with ORA-494 is kept is because, in most cases, customers would prefer crashing the instance than having a cluster-wide hang.

With this parameter, if the enqueue holder is a background process, then it will not be killed, therefore the instance will not crash. If the enqueue holder is not a background process, the new 10.2.0.4 mechanism will still try to kill it.

可以杀掉任何oracle的后台进程

总结:

os 调整 ,调整 sysctl.conf 文件

 sysctl -w vm.dirty_ratio=10

 sysctl -w vm.dirty_background_ratio=5

db 调整

 alter system set "_kill_controlfile_enqueue_blocker"=false;

alter system set "_kill_enqueue_blocker"=1 scope=spfile;
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐