Linux下,多线程程序死循环问题调试
2014-12-03 00:19
423 查看
当你的软件在某个时刻停止服务,CPU占用达到100%+,这种问题一个可能的原因是产生了死循环,假设程序某处存在潜在的死循环,并在某种条件下会引发,本文以一个示例来定位出现死循环的位置。
当程序某处存在死循环,通常定位问题及缩小范围的方法是,在可疑的代码处加log,或者注释掉可疑代码,这对于容易重现问题的程序来说还好,但对于“偶尔”才会产生问题程序却很难调试,因为我们很难重现程序故障。本文所述的调试过程正是在这种情况下,假设问题已经出现,我们要求环境保护现场,即出问题的程序还在运行中。
1.我们首先要知道是哪个线程出了问题:
首先查一下出问题进程的pid,例如
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
ovtsvn@ovtsvn:~/MASS4/src/icdn/src$ ps -ef | grep icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
ovtsvn 11065 1 50 11:57 ? 00:00:07 ./icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
ovtsvn 11076 10971 0 11:57 pts/2 00:00:00 grep icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
ovtsvn@ovtsvn:~/MASS4/src/icdn/src$
然后top命令查看线程信息:
top -H -p 11065
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11073 ovtsvn 25 0 325m 3980 2236 R 100 0.4 1:40.84 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11065 ovtsvn 18 0 325m 3980 2236 S 0 0.4 0:00.01 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11066 ovtsvn 18 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11067 ovtsvn 15 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11068 ovtsvn 15 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11069 ovtsvn 18 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11070 ovtsvn 18 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11071 ovtsvn 22 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11072 ovtsvn 15 0 325m 3980 2236 R 0 0.4 0:00.00 icdn
从上面可以看出,出问题线程PID为11073
2.接下来,我们用gdb来attach目标进程
执行: gdb icdn 11065
在gdb中,列出线程状态:
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb) info threads
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
9 Thread 47056948181264 (LWP 11066) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
8 Thread 47056956573968 (LWP 11067) 0x00002acc4a406fc2 in select () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
7 Thread 47056964966672 (LWP 11068) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
6 Thread 47056973359376 (LWP 11069) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
5 Thread 47056981752080 (LWP 11070) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
4 Thread 47056990144784 (LWP 11071) 0x00002acc4a40e63c in recvfrom () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
3 Thread 47057194060048 (LWP 11072) 0x00002acc4a406fc2 in select () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
2 Thread 47057226893584 (LWP 11073) CSendFile::SendFile (this=0x2acc5d4aff40, pathname=@0x2acc5d4afee0)
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
at ../src/csendfile.cpp:101
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
1 Thread 47056939784832 (LWP 11065) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb)
gdb已经列出了各线程正在执行的函数,我们需要更多信息,记住11073对应的行首标号,这是gdb为线程分配的id,这里为2,然后执行切换:
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb) thread 2
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
[Switching to thread 2 (Thread 47057226893584 (LWP 11073))]#0 CSendFile::SendFile (this=0x2acc5d4aff40, pathname=@0x2acc5d4afee0)
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
at ../src/csendfile.cpp:101
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
101 while(1)
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb)
bt一下:
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb) bt
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#0 CSendFile::SendFile (this=0x2acc5d4aff40, pathname=@0x2acc5d4afee0) at ../src/csendfile.cpp:101
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#1 0x000000000040592e in CIcdn::TaskThread (pParam=0x7fff617eafe0) at ../src/cicdn.cpp:128
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#2 0x00002acc4a90b73a in start_thread () from /lib/libpthread.so.0
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#3 0x00002acc4a40d6dd in clone () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#4 0x0000000000000000 in ?? ()
来看一下101行的代码:
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb) l
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
96 }
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
97
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
98 int CSendFile::SendFile(const string& pathname)
![](http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
99 {
![](http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif)
100 int n;
![](http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif)
101 while(1)
![](http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
102 {
![](http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif)
103 n++;
![](http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
104 }
![](http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif)
105 //read file and send
现在我们定位到了出问题的代码位置,这里的循环只用来演示的。
最后别忘了detach
/**************************************************************************************************/
使用top观察是否存在CPU使用率过高现象
对CPU使用率过高的进程的所有线程进行排序
使用gdb attach nmsagent所在的进程,在gdb中使用 info threads显示所有线程
得到如下结果,可以发现2909线程的编号是12
使用thread 切换线程,使用bt显示线程栈
得到如下线程栈
得到进程ID 21465
得到线程时间, 其中最占CPU的是 EpollRecvTask 21581
使用 strace -p 21581 得到线程栈
当程序某处存在死循环,通常定位问题及缩小范围的方法是,在可疑的代码处加log,或者注释掉可疑代码,这对于容易重现问题的程序来说还好,但对于“偶尔”才会产生问题程序却很难调试,因为我们很难重现程序故障。本文所述的调试过程正是在这种情况下,假设问题已经出现,我们要求环境保护现场,即出问题的程序还在运行中。
1.我们首先要知道是哪个线程出了问题:
首先查一下出问题进程的pid,例如
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
ovtsvn@ovtsvn:~/MASS4/src/icdn/src$ ps -ef | grep icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
ovtsvn 11065 1 50 11:57 ? 00:00:07 ./icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
ovtsvn 11076 10971 0 11:57 pts/2 00:00:00 grep icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
ovtsvn@ovtsvn:~/MASS4/src/icdn/src$
然后top命令查看线程信息:
top -H -p 11065
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11073 ovtsvn 25 0 325m 3980 2236 R 100 0.4 1:40.84 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11065 ovtsvn 18 0 325m 3980 2236 S 0 0.4 0:00.01 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11066 ovtsvn 18 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11067 ovtsvn 15 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11068 ovtsvn 15 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11069 ovtsvn 18 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11070 ovtsvn 18 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11071 ovtsvn 22 0 325m 3980 2236 S 0 0.4 0:00.00 icdn
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
11072 ovtsvn 15 0 325m 3980 2236 R 0 0.4 0:00.00 icdn
从上面可以看出,出问题线程PID为11073
2.接下来,我们用gdb来attach目标进程
执行: gdb icdn 11065
在gdb中,列出线程状态:
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb) info threads
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
9 Thread 47056948181264 (LWP 11066) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
8 Thread 47056956573968 (LWP 11067) 0x00002acc4a406fc2 in select () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
7 Thread 47056964966672 (LWP 11068) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
6 Thread 47056973359376 (LWP 11069) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
5 Thread 47056981752080 (LWP 11070) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
4 Thread 47056990144784 (LWP 11071) 0x00002acc4a40e63c in recvfrom () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
3 Thread 47057194060048 (LWP 11072) 0x00002acc4a406fc2 in select () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
2 Thread 47057226893584 (LWP 11073) CSendFile::SendFile (this=0x2acc5d4aff40, pathname=@0x2acc5d4afee0)
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
at ../src/csendfile.cpp:101
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
1 Thread 47056939784832 (LWP 11065) 0x00002acc4a3dec91 in nanosleep () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb)
gdb已经列出了各线程正在执行的函数,我们需要更多信息,记住11073对应的行首标号,这是gdb为线程分配的id,这里为2,然后执行切换:
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb) thread 2
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
[Switching to thread 2 (Thread 47057226893584 (LWP 11073))]#0 CSendFile::SendFile (this=0x2acc5d4aff40, pathname=@0x2acc5d4afee0)
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
at ../src/csendfile.cpp:101
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
101 while(1)
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb)
bt一下:
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb) bt
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#0 CSendFile::SendFile (this=0x2acc5d4aff40, pathname=@0x2acc5d4afee0) at ../src/csendfile.cpp:101
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#1 0x000000000040592e in CIcdn::TaskThread (pParam=0x7fff617eafe0) at ../src/cicdn.cpp:128
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#2 0x00002acc4a90b73a in start_thread () from /lib/libpthread.so.0
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#3 0x00002acc4a40d6dd in clone () from /lib/libc.so.6
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
#4 0x0000000000000000 in ?? ()
来看一下101行的代码:
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
(gdb) l
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
96 }
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
97
![](http://www.cppblog.com/Images/OutliningIndicators/None.gif)
98 int CSendFile::SendFile(const string& pathname)
![](http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
99 {
![](http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif)
100 int n;
![](http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif)
101 while(1)
![](http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
102 {
![](http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif)
103 n++;
![](http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
104 }
![](http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif)
105 //read file and send
现在我们定位到了出问题的代码位置,这里的循环只用来演示的。
最后别忘了detach
/**************************************************************************************************/
确定是CPU过高
使用top观察是否存在CPU使用率过高现象
找出线程
对CPU使用率过高的进程的所有线程进行排序ps H -e -o pid,tid,pcpu,cmd --sort=pcpu |grep xxx
得到如下结果,其中线程2909使用了7.8%的CPU.
2907 2913 0.0 ./xxx
2907 2909 7.8 ./xxx
也可以通过查看/proc中的信息来确定高CPU线程. 打印了4列,线程ID,线程名,用户时间和内核时间(排名未分先后)
awk '{print $1,$2,$14,$15}' /proc/2907/task/*/stat
找出调用栈
使用gdb attach nmsagent所在的进程,在gdb中使用 info threads显示所有线程gdb gdb>attach 2907 gdb>info threads
得到如下结果,可以发现2909线程的编号是12
13 Thread 0xad5f2b70 (LWP 2908) 0x004ef0d7 in mq_timedreceive () from /lib/tls/i686/cmov/librt.so.1 12 Thread 0xad58eb70 (LWP 2909) 0x006e0422 in __kernel_vsyscall () 11 Thread 0xad52ab70 (LWP 2910) 0x006e0422 in __kernel_vsyscall () 10 Thread 0xad4f8b70 (LWP 2911) 0x006e0422 in __kernel_vsyscall () 9 Thread 0xad4c6b70 (LWP 2912) 0x006e0422 in __kernel_vsyscall () 8 Thread 0xad3feb70 (LWP 2913) 0x004ef0d7 in mq_timedreceive () from /lib/tls/i686/cmov/librt.so.1 7 Thread 0xace08b70 (LWP 2914) 0x004ef0d7 in mq_timedreceive () from /lib/tls/i686/cmov/librt.so.1 6 Thread 0xac607b70 (LWP 2915) 0x006e0422 in __kernel_vsyscall () 5 Thread 0xac5e6b70 (LWP 2916) 0x006e0422 in __kernel_vsyscall () 4 Thread 0xac361b70 (LWP 2917) 0x006e0422 in __kernel_vsyscall () 3 Thread 0xac2fdb70 (LWP 2918) 0x006e0422 in __kernel_vsyscall () 2 Thread 0xac1fcb70 (LWP 2919) 0x004ef0d7 in mq_timedreceive () from /lib/tls/i686/cmov/librt.so.1 * 1 Thread 0xb78496d0 (LWP 2907) 0x006e0422 in __kernel_vsyscall ()
使用thread 切换线程,使用bt显示线程栈
gdb>thread 12 gdb>bt
得到如下线程栈
#0 0x006e0422 in __kernel_vsyscall () #1 0x001cca26 in nanosleep () from /lib/tls/i686/cmov/libc.so.6 #2 0x001fc2dc in usleep () from /lib/tls/i686/cmov/libc.so.6 #3 0x0806b510 in OspTaskDelay () #4 0x0805c710 in CDispatchTask::NodeMsgSendToSock() () #5 0x0805cc74 in DispatchTaskEntry () #6 0x0806a8e9 in OspTaskTemplateFunc(void*) () #7 0x00d4780e in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #8 0x002027ee in clone () from /lib/tls/i686/cmov/libc.so.6
ps + strace
得到进程ID 21465ps -e |grep cmu 4996 ? 00:00:25 cmu_fjga_sp3 21465 pts/5 00:08:10 cmu
得到线程时间, 其中最占CPU的是 EpollRecvTask 21581
ps -eL |grep 21465
21465 21579 pts/5 00:00:00 CamApp
21465 21580 pts/5 00:00:00 TimerMan Task
21465 21581 pts/5 00:09:02 EpollRecvTask
21465 21582 pts/5 00:00:00
使用 strace -p 21581 得到线程栈
相关文章推荐
- Linux下,多线程程序死循环问题调试
- Linux下,多线程程序死循环问题调试
- Linux下,多线程程序死循环问题调试
- Linux下,多线程程序死循环问题调试
- Linux 多线程程序调用malloc,backtrace引发死锁问题的调试
- Linux下面 多线程死锁问题的调试
- linux运行自启动死循环java程序 及 必须注意的问题
- linux下多线程服务器处理程序问题解决
- Linux下adb驱动问题Linux下使用手机USB调试模式连接ADB进行Android程序的调试
- Linux下使用gdb调试多进程与多线程程序
- 分类: linux程序设计 2012-03-06 22:01 10646人阅读 评论(5) 收藏 举报 linux语言c工具gcc程序开发 调试是每个程序员都会面临的问题. 如何提高程序员的调
- Linux 多线程调试(内存占用、死循环、CPU占用率高……)
- Linux 多线程调试(内存占用、死循环、CPU占用率高……)
- Linux:如何使用gdb调试多进程多线程程序
- Linux gdb调试及后台程序问题
- 一个死循环程序无法重定向的问题?linux
- linux下多线程程序的链接选项问题
- Linux下多线程程序调试方法
- Linux下多线程程序调试方法
- gdb 调试多线程程序 问题集锦