对于复杂的并发程序来说,死锁是个让人头疼的问题。本文提出的是一种亡羊补牢的方法:当程序运行发现异常时,通过gdb查看程序的运行状态,从而发现和定位隐藏错误——比如死锁。
先故意写一段会导致死锁的代码:
===========================代码的分隔线==========================================
#include <iostream>
#include <unistd.h>
#include <boost/thread.hpp>
using namespace std;
boost::mutex mtx;
boost::mutex mtx2;
void run()
{
boost::mutex::scoped_lock
lock(mtx);
sleep(1);
{
boost::mutex::scoped_lock
lock(mtx2);
cout
<< "get two locks"
<< endl;
}
}
void run2()
{
boost::mutex::scoped_lock
lock(mtx2);
sleep(1);
{
boost::mutex::scoped_lock
lock(mtx);
cout
<< "get two locks"
<< endl;
}
}
int main(int argc, char* argv[])
{
boost::thread_group grp;
grp.create_thread(run);
grp.create_thread(run2);
grp.join_all();
return 0;
}
===========================演示的分隔线==========================================
编译:g++ -g test.cpp -pthread -lrt
/usr/lib/libboost_thread-gcc41-mt.a
运行:./a.out
程序如同我们预期的一样挂起了,这时候在另一个窗口运行:(我省略了很多不相关的gdb信息,用户输入的数据以下划线表示)
# ps a |
grep a.out | grep -v "grep"
32087 pts/0 R+
2:06 ./a.out
# gdb a.out
32087
...
(gdb) info
threads
3 Thread 0x4115f940 (LWP 32219) 0x000000363f80d4c4 in
__lll_lock_wait () from /lib64/libpthread.so.0
2 Thread 0x41fc5940 (LWP 32220) 0x000000363f80d4c4 in
__lll_lock_wait () from /lib64/libpthread.so.0
* 1 Thread 0x2ba80b594230 (LWP 32218) 0x000000363f80aee9 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
可以看出,目前进程有三个线程,主线程没啥好说。堵在join_all。现在跑去看看另外两个线程的栈信息
(gdb) thread 2
[Switching to thread 2 (Thread 0x41fc5940 (LWP 32220))]#0
0x000000363f80d4c4 in __lll_lock_wait () from
/lib64/libpthread.so.0
(gdb) bt
#0 0x000000363f80d4c4 in __lll_lock_wait () from
/lib64/libpthread.so.0
#1 0x000000363f808e1a in _L_lock_1034 () from
/lib64/libpthread.so.0
#2 0x000000363f808cdc in pthread_mutex_lock () from
/lib64/libpthread.so.0
#3 0x00000000004037b3 in
boost::mutex::lock
(this=0x6107a0) at
/usr/local/include/boost/thread/pthread/mutex.hpp:50
#4 0x000000000040381f in
boost::unique_lock<boost::mutex>::lock
(this=0x41fc50a0) at
/usr/local/include/boost/thread/locks.hpp:349
#5 0x000000000040385a in
boost::unique_lock<boost::mutex>::unique_lock
(this=0x41fc50a0, m_=...) at
/usr/local/include/boost/thread/locks.hpp:227
#6 0x000000000040225b in run2 () at
test.cpp:25
#7 0x0000000000402e75 in
boost::detail::thread_data<void
(*)()>::run
(this=0x65d2360) at
/usr/local/include/boost/thread/detail/thread.hpp:56
#8 0x0000000000405a60 in thread_proxy ()
#9 0x0000000000000000 in ?? ()
注意,“#6 0x000000000040225b in run2 () at
test.cpp:25”,这就是该线程堵塞的地方。可以通过一样的方法查看另一个线程,如果我有稿费的话,一定很乐意再来演示一次。但可惜没有。所以就这样啦。
===========================装B的分隔线==========================================
抓虫是不得已的权宜之计。bug以防为主。即使通过上述技术能够发现死锁,但如果模块调用关系不清晰,并发混乱的话,说不定虫子抓了一只又一只,掐死一只引入两只,结果改得没完没了。如何在并发程序中避免死锁?最正确的说法就是以同样的顺序要求锁。但有时候问题远没有这么简单——因为同样顺序要求锁主要适用于一组主动对象请求一组资源的语境。更复杂的情况哪天我再好好总结一下吧。