您的位置:首页 > 其它

RAC故障处理一例

2014-05-27 17:52 399 查看
上周六午夜12点刚要睡觉,电话响起,这个时候来电话肯定没啥好事,一看手机号码不认识,通了电话才知道是我们外聘的HP工程师在客户现场处理故障,客户是两台HP小型机做了一个两个节点的RAC,由于客户的原因导致第二个节点系统无法进入多用户模式,估计是在系统里乱操作,删了什么操作系统文件,导致机器只能进入维护模式,因此第二个节点不得不重新安装,HP工程师是克隆了另外一个节点的系统到第二个节点的,然后修改IP,主机名等等的配置好Service
Guard后,HA能起来,但是启动CRS的时候,第二个节点报如下错误:

Attempting to start CRS stack

Failure at scls_scr_create with code 1

Internal Error Information:

Category: 1234

Operation: scls_scr_create

Location: mkdir

Other: Unable to make user dir

Dep: 2

折腾了半天毫无进展,想重启系统然系统自己带起来,但是跟HP的工程师交流了一下,主机起来后CRS是要手工启动的,那么重启就毫无意义了,在Unix、Linux下,CRS的启动停止脚本是放在init.d目录里的,对HP-Unix不太熟悉,问了才知道HP-Unix中,这个目录是在/sbin/init.d 中,而不是/etc/init.d
目录,从这个目录里用./init.crs 脚本来启动CRS,用法如下:

# ./init.crs xxx <--随便输入一个让它显示用法

Usage: ./init.crs {stop|start|enable|disable}

# ./init.crs start

这次的错误信息有参考意义了:

/sbin/init.d/init.cssd[537]: /var/opt/oracle/scls_scr/rqtmsdb2/root/cssrun: Cannot
create the specified file.

Startup will be queued to init within 30 seconds.

错误日志显示CRS不能创建cssrun这个文件,

检查之:

# cd /var/opt/oracle/scls_scr/rqtmsdb2/root/

sh: /var/opt/oracle/scls_scr/rqtmsdb2/root/: not found.

咦,没有这个目录!

# cd /var/opt/oracle/scls_scr/

ls -l 一看就明白了:

# ls -l

total 0

drwxr-xr-x 4 root sys 96 Dec 31 2010 rqtmsdb1

因为这个系统是从第一个节点克隆过来的,所以这个本应该是rqtmsdb2的目录现在是rqtmsdb1,怪不得呢!

修改之:

# mv rqtmsdb1 rqtmsdb2

# ls -l

total 0

drwxr-xr-x 4 root sys 96 Dec 31 2010 rqtmsdb2

# cd rq*

# ls -l

total 16

drwxr-xr-x 2 orarac sys 96 Dec 31 2010 orarac

drwxr-xr-x 2 root sys 8192 Nov 17 09:55 root

# cd root

# ls -l

total 48

-rw-rw-rw- 1 root root 8 Nov 17 15:33 crsdboot

-rw-r--r-- 1 root sys 7 Dec 31 2010 crsstart

-rw-rw-rw- 1 root sys 6 Nov 17 15:33 cssrun

-rw-r--r-- 1 root sys 0 Nov 17 15:33 noclsmon

-rw-rw-rw- 1 root root 0 Nov 17 15:33 nooprocd

再次启动CRS:

# cd /sbin/init.d

#

# ./init.crs
start

Startup will be queued to init within 30 seconds.

# ps -ef|grep d.bin

root 18734 22410 1 02:22:49 pts/ta 0:00
grep d.bin

# ps -ef|grep d.bin

root 2059 1 0 22:03:36 ? 0:00 /ora_soft/oracle/product/crs/bin/crsd.bin
reboot

orarac 18782 2057 0 02:23:09 ? 0:00 /ora_soft/oracle/product/crs/bin/evmd.bin

orarac 19013 19012 0 02:23:14 ? 0:00 /ora_soft/oracle/product/crs/bin/ocssd.bin

# /ora_soft/oracle/product/crs/bin/crsctl
check crs

CSS appears healthy

CRS appears healthy

EVM appears healthy

# /ora_soft/oracle/product/crs/bin/crlctl
stop crs

sh: /ora_soft/oracle/product/crs/bin/crlctl: not
found.

# /ora_soft/oracle/product/crs/bin/crsctl
stop crs

Stopping resources.

Successfully stopped CRS resources

Stopping CSSD.

Shutting down CSS daemon.

Shutdown request successfully issued.

# ps -ef|grep d.bin

root 21987 22410 0 02:24:53 pts/ta 0:00
grep d.bin

# /ora_soft/oracle/product/crs/bin/crsctl
start crs

Attempting to start CRS stack

The CRS stack will be started shortly

# ps -ef|grep d.bin

root 23992 22410 0 02:32:59 pts/ta 0:00
grep d.bin

# ps -ef|grep d.bin

root 23995 22410 0 02:33:05 pts/ta 0:00
grep d.bin

# ps -ef|grep d.bin

root 21829 1 0 02:24:44 ? 0:00 /ora_soft/oracle/product/crs/bin/crsd.bin
reboot

orarac 24152 21817 0 02:33:18 ? 0:00 /ora_soft/oracle/product/crs/bin/evmd.bin

orarac 24299 24298 0 02:33:21 ? 0:00 /ora_soft/oracle/product/crs/bin/ocssd.bin

root 24577 22410 0 02:33:31 pts/ta 0:00
grep d.bin

# /ora_soft/oracle/product/crs/bin/crsctl
status

Unknown parameter: status

# /ora_soft/oracle/product/crs/bin/crsctl
check crs

CSS appears healthy

CRS appears healthy

EVM appears healthy

#

这次能够正常启动了!

回头检查第一个节点,这个节点HP工程师跟我说什么也没动过,我就信了,克隆一个系统嘛是对这个节点不用做任何改动,但是现实且很残酷!

命令敲下去:

# cd /sbin/init.d

#

# ./init.crs
start

Startup will be queued to init within 30 seconds.

等不到d.bin的进程,无任何反应,回头检查操作系统日志:

Nov 18 03:26:00 rqtmsdb1 syslog: Cluster
Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2104.

Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting
on dependencies. Diagnostics in /tmp/crsctl.2116.

Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting
on dependencies. Diagnostics in /tmp/crsctl.2154.

Nov 18 03:34:16 rqtmsdb1 syslog: Cluster Ready Services waiting
on dependencies. Diagnostics in /tmp/crsctl.2154.

看来有些错误信息啊,其中的一个文件:

#cat /tmp/crsctl.2104

Failed 3 to bind listening endpoint:(ADDRESS=(PROTOCOL=tcp)(HOST=rqtmsdb1-priv))

#

无法绑定监听到PricateIP上,再去检查/etc/hosts文件,发现没有Pricate
IP!,只有第二个节点的Pricate IP,再去检查第二个节点的/etc/hosts文件,对比后添加第一个节点的Pricate IP :

192.168.0.1 rqtmsdb1-priv

没在开始去检查/etc/hosts文件真是失误啊!听到的一定要自己再确认一遍!又一次在RAC环境里载在/etc/hosts文件手里!!!之前在一个客户那里配置RAC,工程师给我将localhosts这个系统默认的东东去掉了,导致我在这个上面花了一天的时间才找到是没有localhosts导致的!

再次启动CRS,这次正常启动了!以为一切都好了,可以去睡觉了,没先到后面VIP还有问题,

crs_start -all 启动Cluste,报告不能启动,VIP起不来,后面的就都失败了,这个错误好办,之前解决过,先设置对VIP进行debug:

#/ora_soft/oracle/product/crs/bin/crsctl debug log
res "ora.rqtmsdb1.vip:5"

然后单独启动VIP资源:

# /ora_soft/oracle/product/crs/bin/srvctl start nodeapps -n rqtmsdb1

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] Checking interface existance

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] Calling
getifbyip

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] getifbyip: started for 172.16.7.22

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] Completed
getifbyip

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:29 EAT 2012 [ 25193 ] switched
to standby : start/check operation

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] Completed
with initial interface test

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] Broadcast = 172.16.7.255

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] Interface tests

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: start for if=lan0

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: get default gw

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] defaultgw: started

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] defaultgw: completed
with

rqtmsdb1:ora.rqtmsdb1.vip:checkIf:
Default gateway is not defined (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Interface
lan0 checked failed (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: end for if=lan0

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25193 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1
and IF_USING =

rqtmsdb1:ora.rqtmsdb1.vip:Invalid
parameters, or failed to bring up VIP (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] Checking interface existance

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] Calling
getifbyip

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] getifbyip: started for 172.16.7.22

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] Completed
getifbyip

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:33 EAT 2012 [ 25341 ] switched
to standby : start/check operation

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Completed
with initial interface test

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Broadcast = 172.16.7.255

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Performing
CRS_STAT testing

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Completed
CRS_STAT testing

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] Interface tests

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: start for if=lan0

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: get default gw

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] defaultgw: started

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] defaultgw: completed
with

rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not
defined (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: end for if=lan0

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:37 EAT 2012 [ 25341 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1
and IF_USING =

rqtmsdb1:ora.rqtmsdb1.vip:Invalid
parameters, or failed to bring up VIP (host=rqtmsdb1)

CRS-1006: No more members to consider

CRS-0215: Could not start resource 'ora.rqtmsdb1.vip'.

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] Checking interface existance

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] Calling
getifbyip

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] getifbyip: started for 172.16.7.22

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] Completed
getifbyip

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:48 EAT 2012 [ 25801 ] switched
to standby : start/check operation

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] Completed
with initial interface test

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] Broadcast = 172.16.7.255

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] Interface tests

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: start for if=lan0

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: get default gw

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] defaultgw: started

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] defaultgw: completed
with

rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not
defined (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: end for if=lan0

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25801 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1
and IF_USING =

rqtmsdb1:ora.rqtmsdb1.vip:Invalid
parameters, or failed to bring up VIP (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] Checking interface existance

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] Calling
getifbyip

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] getifbyip: started for 172.16.7.22

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] Completed
getifbyip

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:52 EAT 2012 [ 25949 ] switched
to standby : start/check operation

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Completed
with initial interface test

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Broadcast = 172.16.7.255

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Performing
CRS_STAT testing

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Completed
CRS_STAT testing

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] Interface tests

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: start for if=lan0

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: get default gw

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] defaultgw: started

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] defaultgw: completed
with

rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not
defined (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: end for if=lan0

rqtmsdb1:ora.rqtmsdb1.vip:Sun
Nov 18 04:19:56 EAT 2012 [ 25949 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1
and IF_USING =

rqtmsdb1:ora.rqtmsdb1.vip:Invalid
parameters, or failed to bring up VIP (host=rqtmsdb1)

CRS-0215: Could not start resource 'ora.rqtmsdb1.LISTENER_RQTMSDB1.lsnr'.

#

没有配置默认网关,在检查IP地址配置情况,发现,IP地址是配置在lan2上的,一问才知道,由于lan0经常出问题,这次改到lan2,不早说啊,nnd!!

VIP在启动的时候回去ping默认网关,如果不通,那么VIP是起不来的。HP工程师配置好默认网关后,修改VIP到lan0上去:

先删除之:

su - oracle

oifcfg delif -global

然后再重新配置:

$oifcfg setif -global lan2/172.16.7.0:public

$oifcfg setif -global lan3/192.168.0.0:cluster_interconnect

#/ora_soft/oracle/product/crs/bin/srvctl modify nodeapps -n rqtmsdb2 -A 172.16.7.23/255.255.255.0/lan2

#/ora_soft/oracle/product/crs/bin/srvctl modify nodeapps -n rqtmsdb1 -A 172.16.7.22/255.255.255.0/lan2

修改完成后再次crs_start -all ,RAC启动成功,手工,睡觉!
http://blog.chinaunix.net/uid-26896647-id-3417998.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: