您的位置:首页 > 理论基础

为什么KVM计算机点无故重启?

2015-12-04 21:00 316 查看
一.故障1:机器hangs

本地一台cloudstack计算节点无故连不上了,cloudstack也坏了,后查看有一台系统虚拟机在这台计算节点上,导致cs挂了。去找到这台机器后,发现这台机器卡住了,重启后卡在starting udev,等好久也不行,即使进入单用户也是一样,重启很多次也是卡在这。最后查到一篇文章,原话:

Remove
quiet
from the kernel command line and you should get enough output to see the cause of the hang.

然后进入系统菜单,在编辑kernel行,quiet静默模式。相当于设置"loglevel=4"(WARNING)。将quiet删除后,等待进入系统。大概一个小时后,进去系统了,能ssh了,问题解决了一个,得查一下机器为什么重启。

二.故障2:kvm计算机点为什么无故重启

通过查看cloudstack-agent的日志。

[root@kvm204 ~]# tail -1000f /var/log/cloudstack/agent/cloudstack-agent.out




看到了这里有报警,并且执行了reboot命令。继续往上面翻,又看到一条信息,如下:



看到很多这样的报错,但最后一次报错是下面这个,可能是它导致了kvm检测脚本执行了reboot命令,然后就出现了本文第一张图片日志里reboot the host。



现在知道了是谁重启了电脑,但kvm为什么会重启这台计算节点呢?带着疑问,我去查看了一下那个脚本,也就是kvmheartbeat.sh脚本。内容如下:

#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0 #
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

help() {
printf "Usage: $0
-i nfs server ip
-p nfs server path
-m mount point
-h host
-r write/read hb log
-c cleanup
-t interval between read hb log\n"
exit 1
}
#set -x
NfsSvrIP=
NfsSvrPath=
MountPoint=
HostIP=
interval=
rflag=0
cflag=0

while getopts 'i:p:m:h:t:rc' OPTION
do
case $OPTION in
i)
NfsSvrIP="$OPTARG"
;;
p)
NfsSvrPath="$OPTARG"
;;
m)
MountPoint="$OPTARG"
;;
h)
HostIP="$OPTARG"
;;
r)
rflag=1
;;
t)
interval="$OPTARG"
;;
c)
cflag=1
;;
*)
help
;;
esac
done

if [ -z "$NfsSvrIP" ]
then
exit 1
fi

#delete VMs on this mountpoint
deleteVMs() {
local mountPoint=$1
vmPids=$(ps aux| grep qemu | grep "$mountPoint" | awk '{print $2}' 2> /dev/null)
if [ $? -gt 0 ]
then
return
fi

if [ -z "$vmPids" ]
then
return
fi

for pid in $vmPids
do
kill -9 $pid &> /dev/null
done
}

#checking is there the same nfs server mounted under $MountPoint?
mounts=$(cat /proc/mounts |grep nfs|grep $MountPoint)
if [ $? -gt 0 ]
then
# remount it
mount $NfsSvrIP:$NfsSvrPath $MountPoint -o sync,soft,proto=tcp,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,noac,timeo=133,retrans=10 &> /dev/null
if [ $? -gt 0 ]
then
printf "Failed to remount $NfsSvrIP:$NfsSvrPath under $MountPoint"
exit 1
fi
if [ "$rflag" == "0" ]
then
deleteVMs $MountPoint
fi
fi

hbFolder=$MountPoint/KVMHA/
hbFile=$hbFolder/hb-$HostIP

write_hbLog() {
#write the heart beat log
stat $hbFile &> /dev/null
if [ $? -gt 0 ]
then
# create a new one
mkdir -p $hbFolder &> /dev/null
touch $hbFile &> /dev/null
if [ $? -gt 0 ]
then
printf "Failed to create $hbFile"
return 2
fi
fi

timestamp=$(date +%s)
echo $timestamp > $hbFile
return $?
}

check_hbLog() {
now=$(date +%s)
hb=$(cat $hbFile)
diff=`expr $now - $hb`
if [ $diff -gt $interval ]
then
return 1
fi
return 0
}

if [ "$rflag" == "1" ]
then
check_hbLog
if [ $? == 0 ]
then
echo "=====> ALIVE <====="
else
echo "=====> DEAD <======"
fi
exit 0
elif [ "$cflag" == "1" ]
then
reboot
exit $?
else
write_hbLog
exit $?
fi


上面这个脚本是我们现在的计算节点上的版本。从上面可以看出,有判断reboot命令,也就是知道了为什么会重启。但这很坑爹呀,你检测检测,你还重启。。。。

下面这个脚本是我在github上看到的。而且版本已经更新。距离现在8月。应该是最新的,从下面看出,这个坑爹的reboot命令已经被修改了。


#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0 #
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

help() {
printf "Usage: $0
-i nfs server ip
-p nfs server path
-m mount point
-h host
-r write/read hb log
-c cleanup
-t interval between read hb log\n"
exit 1
}
#set -x
NfsSvrIP=
NfsSvrPath=
MountPoint=
HostIP=
interval=
rflag=0
cflag=0

while getopts 'i:p:m:h:t:rc' OPTION
do
case $OPTION in
i)
NfsSvrIP="$OPTARG"
;;
p)
NfsSvrPath="$OPTARG"
;;
m)
MountPoint="$OPTARG"
;;
h)
HostIP="$OPTARG"
;;
r)
rflag=1
;;
t)
interval="$OPTARG"
;;
c)
cflag=1
;;
*)
help
;;
esac
done

if [ -z "$NfsSvrIP" ]
then
exit 1
fi

#delete VMs on this mountpoint
deleteVMs() {
local mountPoint=$1
vmPids=$(ps aux| grep qemu | grep "$mountPoint" | awk '{print $2}' 2> /dev/null)
if [ $? -gt 0 ]
then
return
fi

if [ -z "$vmPids" ]
then
return
fi

for pid in $vmPids
do
kill -9 $pid &> /dev/null
done
}

#checking is there the same nfs server mounted under $MountPoint?
mounts=$(cat /proc/mounts |grep nfs|grep $MountPoint)
if [ $? -gt 0 ]
then
# remount it
mount $NfsSvrIP:$NfsSvrPath $MountPoint -o sync,soft,proto=tcp,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,noac,timeo=133,retrans=10 &> /dev/null
if [ $? -gt 0 ]
then
printf "Failed to remount $NfsSvrIP:$NfsSvrPath under $MountPoint"
exit 1
fi
if [ "$rflag" == "0" ]
then
deleteVMs $MountPoint
fi
fi

hbFolder=$MountPoint/KVMHA/
hbFile=$hbFolder/hb-$HostIP

write_hbLog() {
#write the heart beat log
stat $hbFile &> /dev/null
if [ $? -gt 0 ]
then
# create a new one
mkdir -p $hbFolder &> /dev/null
touch $hbFile &> /dev/null
if [ $? -gt 0 ]
then
printf "Failed to create $hbFile"
return 2
fi
fi

timestamp=$(date +%s)
echo $timestamp > $hbFile
return $?
}

check_hbLog() {
now=$(date +%s)
hb=$(cat $hbFile)
diff=`expr $now - $hb`
if [ $diff -gt $interval ]
then
return 1
fi
return 0
}

if [ "$rflag" == "1" ]
then
check_hbLog
if [ $? == 0 ]
then
echo "=====> ALIVE <====="
else
echo "=====> DEAD <======"
fi
exit 0
elif [ "$cflag" == "1" ]
then
/usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was unable to write the heartbeat to the storage."
sync &
sleep 5
echo b > /proc/sysrq-trigger
exit $?
else
write_hbLog
exit $?
fi
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: