Node reboot or eviction: How to check if your private interconnect CRS can transmit network heartbea
2014-11-26 13:35
501 查看
Node reboot or eviction: How to check if your private interconnect CRS can transmit network heartbeats (Doc ID 1445075.1)
In this Document
Oracle Server - Enterprise Edition - Version 10.1.0.2 to 11.2.0.3 [Release 10.1 to 11.2]
Information in this document applies to any platform.
Frequently, in the case of node reboots, the log of the CSS daemon processes (ocssd.log) indicates that the network heartbeat from one or more remote nodes was not received (for example,
the message "CRS-1610:Network communication with node xxxxxx (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.656 seconds" appears in the ocssd.log), and that the node subsequently was rebooted (to avoid a split brain or because
it was evicted by another node).
The script in here performs the network connectivity check using ssh. This check complements ping or traceroute since ssh uses TCP protocol while ping uses ICMP and traceroute in Linux/Unix uses UDP (traceroute on Windows use ICMP).
The network communication involves both the actual physical connection and the OS layer such as IP, UDP, and TCP.
CRS (10g and 11.1) uses TCP to communicate, so using ssh to test the connection as well as TCP and IP layer is a better test than ping or traceroute.
Because CRS on 11.2 uses UDP to communicate, using ssh to test TCP is not the optimal test, but this test will complement the traceroute test.
The script tests the private interconnect once every 5 seconds, so this script will put an insignificant load on the server.
1) Create a file in a location of your choice and copy and paste the lines in the following note box:
#!/bin/ksh
export TODAY=`date "+%Y%m%d"`
while [ $TODAY -lt <the time you want the script to stop running> ] # format needs to be YearMonthDate
do
export TODAY=`date "+%Y%m%d"`
export LOGFILE=<log file directory>/interconnect_test_${TODAY}.log
ssh <private Ip address for node 1> "hostname; date" >> $LOGFILE 2>&1
ssh <private Ip address for node 2> "hostname; date" >> $LOGFILE 2>&1
echo "" >> $LOGFILE
echo "" >> $LOGFILE
sleep 5
done
2) Replace <private Ip address for node 2> with real private interconnect IP address or private interconnect host name. The script will execute the commands, "hostname" and "date", and output to a log file.
3) If there are more than two nodes in the cluster, add more lines to issue
sh <private Ip address for node 1> "hostname; date" >> $LOGFILE 2>&1
Make sure that this script issues ssh to every node including the local node.
4) Replace <log file directory> with a real directory name where the output of this script will go.
The script will likely grow less than one MB every day, so you do not need large amount of space.
You can also regularly delete old log files.
5) Replace <the time you want the script to stop running> with the date and year that you want the script to stop running. The format has to be YearMonthDate like 20121231 for December 31, 2012.
6) Save the file and issue "chmod +x <the script file name>" to make the script executable.
7) Make sure that the ssh works without asking for any password over the private interconnect.
It is best to first test the ssh connection over the private interconnect from all nodes to every other node including itself (local node).
8) Issue "nohup <the script file name> &" to run the script in background.
Run this script from every node in the cluster.
===============================================
How to interpret the output in the log file:
When there is a problem with the private interconnect or when the node is down, the date shown in log file will not be once every 5 seconds but longer.
If the difference is more than 10 seconds between succeeding dates when the script was running, then the network/server is having serious delay in transmitting network heartbeats. If the difference is greater than 30 seconds, the node will reboot, so you will
likely not see the difference that is greater than 30 seconds.
Find out the approximate time that the node is rebooted and check when the script show last output before the node is rebooted. If the time difference is more than 15 seconds, then the network problem is the cause of the missing network heartbeats. Investigate
the reason that ssh (a regular OS command) hang.
==============================================
The following script is an example from the three node cluster:
#!/bin/ksh
export TODAY=`date "+%Y%m%d"`
while [ $TODAY -lt 20121231 ] # format needs to be YearMonthDate
do
export TODAY=`date "+%Y%m%d"`
export LOGFILE=/tmp/interconnect_test_${TODAY}.log
ssh drrac1-priv "hostname; date" >> $LOGFILE 2>&1
ssh drrac2-priv "hostname; date" >> $LOGFILE 2>&1
ssh drrac3-priv "hostname; date" >> $LOGFILE 2>&1
echo "" >> $LOGFILE
echo "" >> $LOGFILE
sleep 5
done
In this Document
Goal |
Fix |
APPLIES TO:
Oracle Server - Enterprise Edition - Version 10.1.0.2 to 11.2.0.3 [Release 10.1 to 11.2]Information in this document applies to any platform.
GOAL
Frequently, in the case of node reboots, the log of the CSS daemon processes (ocssd.log) indicates that the network heartbeat from one or more remote nodes was not received (for example,the message "CRS-1610:Network communication with node xxxxxx (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.656 seconds" appears in the ocssd.log), and that the node subsequently was rebooted (to avoid a split brain or because
it was evicted by another node).
The script in here performs the network connectivity check using ssh. This check complements ping or traceroute since ssh uses TCP protocol while ping uses ICMP and traceroute in Linux/Unix uses UDP (traceroute on Windows use ICMP).
The network communication involves both the actual physical connection and the OS layer such as IP, UDP, and TCP.
CRS (10g and 11.1) uses TCP to communicate, so using ssh to test the connection as well as TCP and IP layer is a better test than ping or traceroute.
Because CRS on 11.2 uses UDP to communicate, using ssh to test TCP is not the optimal test, but this test will complement the traceroute test.
The script tests the private interconnect once every 5 seconds, so this script will put an insignificant load on the server.
FIX
1) Create a file in a location of your choice and copy and paste the lines in the following note box:#!/bin/ksh
export TODAY=`date "+%Y%m%d"`
while [ $TODAY -lt <the time you want the script to stop running> ] # format needs to be YearMonthDate
do
export TODAY=`date "+%Y%m%d"`
export LOGFILE=<log file directory>/interconnect_test_${TODAY}.log
ssh <private Ip address for node 1> "hostname; date" >> $LOGFILE 2>&1
ssh <private Ip address for node 2> "hostname; date" >> $LOGFILE 2>&1
echo "" >> $LOGFILE
echo "" >> $LOGFILE
sleep 5
done
2) Replace <private Ip address for node 2> with real private interconnect IP address or private interconnect host name. The script will execute the commands, "hostname" and "date", and output to a log file.
3) If there are more than two nodes in the cluster, add more lines to issue
sh <private Ip address for node 1> "hostname; date" >> $LOGFILE 2>&1
Make sure that this script issues ssh to every node including the local node.
4) Replace <log file directory> with a real directory name where the output of this script will go.
The script will likely grow less than one MB every day, so you do not need large amount of space.
You can also regularly delete old log files.
5) Replace <the time you want the script to stop running> with the date and year that you want the script to stop running. The format has to be YearMonthDate like 20121231 for December 31, 2012.
6) Save the file and issue "chmod +x <the script file name>" to make the script executable.
7) Make sure that the ssh works without asking for any password over the private interconnect.
It is best to first test the ssh connection over the private interconnect from all nodes to every other node including itself (local node).
8) Issue "nohup <the script file name> &" to run the script in background.
Run this script from every node in the cluster.
===============================================
How to interpret the output in the log file:
When there is a problem with the private interconnect or when the node is down, the date shown in log file will not be once every 5 seconds but longer.
If the difference is more than 10 seconds between succeeding dates when the script was running, then the network/server is having serious delay in transmitting network heartbeats. If the difference is greater than 30 seconds, the node will reboot, so you will
likely not see the difference that is greater than 30 seconds.
Find out the approximate time that the node is rebooted and check when the script show last output before the node is rebooted. If the time difference is more than 15 seconds, then the network problem is the cause of the missing network heartbeats. Investigate
the reason that ssh (a regular OS command) hang.
==============================================
The following script is an example from the three node cluster:
#!/bin/ksh
export TODAY=`date "+%Y%m%d"`
while [ $TODAY -lt 20121231 ] # format needs to be YearMonthDate
do
export TODAY=`date "+%Y%m%d"`
export LOGFILE=/tmp/interconnect_test_${TODAY}.log
ssh drrac1-priv "hostname; date" >> $LOGFILE 2>&1
ssh drrac2-priv "hostname; date" >> $LOGFILE 2>&1
ssh drrac3-priv "hostname; date" >> $LOGFILE 2>&1
echo "" >> $LOGFILE
echo "" >> $LOGFILE
sleep 5
done
相关文章推荐
- unable to connect to your virtual device genymotion will now stop check your virtualbox network
- facebook permissions : How to check if the user has already allowed publish_stream for your app
- Unable to connect to your virtual device!Genymotion will now stop.Check your ViryualBox network conf
- How to check if your hardware supports virtualization
- unable to connect to your virtual device genymotion will now stop check your virtualbox network
- How to check if a journal is EI, SCI, ISI, SCIE or SSCI and get its Impact Factor
- How to check if a Number is Positive or Negative in Java - Interview Question
- How to Check if Your CPU Supports Second Level Address Translation (SLAT)
- How to check if a machine is physical or virtual
- A network-related or instance-specific error occurred when applicatoin connect to a DB
- [Drupal] How to get the real path of a node, no matter it is a path or a url alias
- C# Tips: How to tell if system is little endian or big endian?
- How To Find Your Ubuntu or Kernel Version -> 查看ubuntu版本
- How to shutdown or reboot the machine on remote desktop connection?
- how can i make this account disabled or not available to others?
- How can I connect to Android with ADB over TCP?
- How to check whether the patches have been applied or not
- C Tips: How to tell if system is little endian or big endian?
- How to programmatically check whether a user has responded to a Survey or not?
- 【Forward】Best Ideas on How to Complete a Lifebook for Your Adopted or Foster Child