Oracle 11.2.0.4.0 RAC下DRM导致单节点宕机
2017-08-15 16:38
716 查看
DRM的bug太多,所以建议直接关闭。
alert日志:
Errors in file /oracle/app/oracle/diag/rdbms/gg/gg1/trace/gg1_lmon_60688126.trc:
ORA-29702: error occurred in Cluster Group Service operation
No connectivity to other instances in the cluster during startup. Hence, LMON is terminating the instance. Please check the LMON trace file for details. Also, please check the network logs of this instance along with clusterwide network health for problems
and then re-start this instance.
LMON (ospid: 63112654): terminating the instance
Dumping diagnostic data in directory=[cdmp_20170814161033], requested by (instance=1, osid=63112654 (LMON)), summary=[abnormal instance termination].
Instance terminated by LMON, pid = 63112654
LMON: 各实例的LMON进程会定期通信,以检查集群中各节点的健康状态,当某个节点出现故障时,负责集群重构、GRD恢复等操作,它提供的服务叫CGS(cluster group services)。LMON可以和下层的clusterware合作也可以单独工作。当LMON检测到实例级别的脑裂时,LMON会通知下层的clusterware,期待clusterware解决脑裂问题,但是RAC并不假设clusterware肯定能够解决问题,因此,LMON不会无尽等待clusterware层的处理结果。如果发生等待超时,LMON会自动触发IMR(instance
membership recovery)IMR功能可以看做是oracle在数据库层提供的脑裂、IO隔离机制。
LMON主要是借助两种心跳机制来完成健康检测:
1.节点间的网络心跳。
2.控制文件的磁盘心跳。每个节点的CKPT进程每隔3S更新一次控制文件一个数据块。可以通过x$kcccp看到这个动作。SQL>select inst_id,cphbt from x$kcccp
gg1_lmon_60688126.trc:
2017-08-14 16:07:40.381460 : kjfspseudorcfg: requested with reason 5(DRM Quiesce step stall)
* kjfcln: DRM aborted due to CGS rcfg.
*** 2017-08-14 16:07:44.621
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a50 rcfgtm 5 sec
*** 2017-08-14 16:07:49.605
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a55 rcfgtm 10 sec
*** 2017-08-14 16:07:54.581
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a5a rcfgtm 15 sec
............................................................................
*** 2017-08-14 16:08:59.675
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a9b rcfgtm 80 sec
*** 2017-08-14 16:09:04.694
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915aa0 rcfgtm 85 sec
kjxgmpoll: the CGS reconfiguration has spent 85 seconds.
kjxgmpoll: terminate the CGS reconfig.
Error: Cluster Group Service reconfiguration takes too long
LMON caught an error 29702 in the main loop
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation
CGS reconfig的原因也正是由于DRM操作失败导致。
alert日志:
Errors in file /oracle/app/oracle/diag/rdbms/gg/gg1/trace/gg1_lmon_60688126.trc:
ORA-29702: error occurred in Cluster Group Service operation
No connectivity to other instances in the cluster during startup. Hence, LMON is terminating the instance. Please check the LMON trace file for details. Also, please check the network logs of this instance along with clusterwide network health for problems
and then re-start this instance.
LMON (ospid: 63112654): terminating the instance
Dumping diagnostic data in directory=[cdmp_20170814161033], requested by (instance=1, osid=63112654 (LMON)), summary=[abnormal instance termination].
Instance terminated by LMON, pid = 63112654
LMON: 各实例的LMON进程会定期通信,以检查集群中各节点的健康状态,当某个节点出现故障时,负责集群重构、GRD恢复等操作,它提供的服务叫CGS(cluster group services)。LMON可以和下层的clusterware合作也可以单独工作。当LMON检测到实例级别的脑裂时,LMON会通知下层的clusterware,期待clusterware解决脑裂问题,但是RAC并不假设clusterware肯定能够解决问题,因此,LMON不会无尽等待clusterware层的处理结果。如果发生等待超时,LMON会自动触发IMR(instance
membership recovery)IMR功能可以看做是oracle在数据库层提供的脑裂、IO隔离机制。
LMON主要是借助两种心跳机制来完成健康检测:
1.节点间的网络心跳。
2.控制文件的磁盘心跳。每个节点的CKPT进程每隔3S更新一次控制文件一个数据块。可以通过x$kcccp看到这个动作。SQL>select inst_id,cphbt from x$kcccp
gg1_lmon_60688126.trc:
2017-08-14 16:07:40.381460 : kjfspseudorcfg: requested with reason 5(DRM Quiesce step stall)
* kjfcln: DRM aborted due to CGS rcfg.
*** 2017-08-14 16:07:44.621
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a50 rcfgtm 5 sec
*** 2017-08-14 16:07:49.605
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a55 rcfgtm 10 sec
*** 2017-08-14 16:07:54.581
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a5a rcfgtm 15 sec
............................................................................
*** 2017-08-14 16:08:59.675
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a9b rcfgtm 80 sec
*** 2017-08-14 16:09:04.694
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915aa0 rcfgtm 85 sec
kjxgmpoll: the CGS reconfiguration has spent 85 seconds.
kjxgmpoll: terminate the CGS reconfig.
Error: Cluster Group Service reconfiguration takes too long
LMON caught an error 29702 in the main loop
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation
CGS reconfig的原因也正是由于DRM操作失败导致。
相关文章推荐
- 11gR2 RAC启用iptables导致节点宕机问题处理
- Grid Control中进行Dataguard切换 因RAC节点的oracle用户密码不一致导致无法切换
- 11gR2 RAC启用iptables导致节点宕机问题处理
- [Oracle 11g r2(11.2.0.4.0)]案例分析8-本地节点hang 住导致的集群重新配置
- [Oracle 11g r2(11.2.0.4.0)]案例分析4-由gipc 进程导致的节点无法启动
- Oracle 11G R2 RAC 添加节点
- rhel7.4安装oracle 11G 11.2.0.4.0 RAC
- oracle 11g rac 第二个节点执行root.sh报错解决一例
- Oracle 10g RAC中的DRM问题及关闭
- Oracle 10g RAC 添加节点
- Oracle 10G RAC - 某个节点上缺失某些归档日志后删除报错问题
- 【Oracle】RAC增加新节点
- Oracle 11g rac 添加新节点测试
- Oracle 10G RAC 删除已有节点
- RedHat 5.6_x86_64 + ASM + RAW+ Oracle 10g RAC (七) 添加节点—初始化环境
- rhel7.2 + Oracle 11.2.0.4 RAC删除添加节点操作
- Oracle 添加RAC数据库集群节点(一)
- Configure Oracle 11gR2 RAC 一节点执行root.sh脚本报错
- Oracle 11g RAC 添加节点故障之--CRS资源启动故障
- 使用Virtual Box安装双节点Oracle 11gR2 RAC