一次大量TIME_WAIT和Recv-Q 堵塞问题排查思路

本文记录了两次公司后台系统故障的详细排查过程及解决方案。首次故障因数据库调用超时导致大量TIME_WAIT状态,调整数据库IP后解决。第二次故障同样由未更新的数据库调用引起,通过调整系统参数和更新数据库配置最终恢复正常。

 

第一天故障:

现象:

公司销售群和售后群炸了,说老后台(1.0版本)崩溃了,因为还有部门的业务没来得及迁移到新后台,我当时正在打农药哈哈~

后台504打不开,偶尔能刷出来也是很慢。。大概过了10分钟新后台也挂了,也就是公司所有业务全瘫痪了。。。问题严重了。。。

排查及原因:

先登陆到我们比较重视的新后台查看,有慢日志看慢日志没有慢日志先加慢日志。

简单查看了一下top  发现load较高,有时候能到20左右,忽高忽低不是很稳定,其他都还算正常。。

接着netstat大法查看下连接数情况,统计了一下大概8000多time_wait,平常肯定没这么多的量。

次数省略几千行...


 
  1. strace -p http的pid

  2.  
  3. strace -p php的pid

  4.  
  5. #通过strace工具追踪了一下web服务和php的进程id,查看他正在做什么?

 

通过跟踪发现有很多调用一个数据库的函数在堆积,找到相应文件下发现原来这里有一个数据库已经下线了,在切换数据库的时候没有更改这个地方,白天切换的时候没发现问题,到了晚上高峰期大量的调用在排队,(今天早上来了问了一下研发,这个调用超时时间是5秒) 导致了大量的TIME_WAIT等待。。堵死。。

处理:

将数据库更改为正常的IP解决了~

 

第二天老后台问题:

 晚上9点左右同样的情况。直接查看连接数情况,果然又堵车了。

 

又堵塞了,大量的连接数不释放,目前业务已经崩了大量的投诉,先快速释放连接数恢复业务。


 
  1. [root@vm-10 ]# cat /etc/sysctl.conf

  2. kernel.sysrq = 1

  3. kernel.core_uses_pid = 1

  4. kernel.msgmnb = 65536

  5. kernel.msgmax = 65536

  6. kernel.shmmax = 68719476736

  7. kernel.shmall = 4294967296

  8. net.ipv4.ip_forward = 0

  9. net.ipv4.conf.default.accept_source_route = 0

  10. net.ipv4.tcp_syncookies = 1

  11. net.ipv4.conf.default.rp_filter = 2

  12. net.ipv4.conf.all.rp_filter = 2

  13. net.ipv4.conf.all.arp_announce = 2

  14.  
  15. net.ipv4.tcp_tw_reuse = 1 #新增

  16. net.ipv4.tcp_tw_recycle = 1 #新增

  17. net.ipv4.tcp_fin_timeout = 30 #新增

  18. [root@vm-10 ]#

net.ipv4.tcp_syncookies = 1 表示开启SYN Cookies。当出现SYN等待队列溢出时,启用cookies来处理,可防范少量SYN攻击,默认为0,表示关闭;
net.ipv4.tcp_tw_reuse = 1 表示开启重用。允许将TIME-WAIT sockets重新用于新的TCP连接,默认为0,表示关闭;
net.ipv4.tcp_tw_recycle = 1 表示开启TCP连接中TIME-WAIT sockets的快速回收,默认为0,表示关闭。

果然调整完后连接数快速降下来了,然后strace查了一下系统调用,发现还有一个数据库没更改


 
  1. lstat("/aaa/www/html/hangye.aaa.com/bbb_OPEN_VIDEO/index.php", {st_mode=S_IFREG|0644, st_size=14994, ...}) = 0

  2. stat("/aaa/www/html/hangye.aaa.com/s", 0x7ffd5f5e22b0) = -1 ENOENT (No such file or directory)

  3. lstat("/aaa", {st_mode=S_IFDIR|0755, st_size=103, ...}) = 0

  4. lstat("/aaa/www", {st_mode=S_IFDIR|0755, st_size=38, ...}) = 0

  5. lstat("/aaa/www/html", {st_mode=S_IFDIR|0755, st_size=107, ...}) = 0

  6. lstat("/aaa/www/html/hangye.aaa.com", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

  7. lstat("/aaa/www/html/hangye.aaa.com/s", 0x7ffd5f5e22b0) = -1 ENOENT (No such file or directory)

  8. setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={60, 0}}, NULL) = 0

  9. rt_sigaction(SIGPROF, {0x7f5710190cc0, [PROF], SA_RESTORER|SA_RESTART, 0x7f5719175660}, {0x7f5710190cc0, [PROF], SA_RESTORER|SA_RESTART, 0x7f5719175660}, 8) = 0

  10. rt_sigprocmask(SIG_UNBLOCK, [PROF], NULL, 8) = 0

  11. getcwd("/", 4095) = 2

  12. chdir("/aaa/www/html/hangye.aaa.com/bbb_OPEN_VIDEO") = 0

  13. setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={30, 0}}, NULL) = 0

  14. gettimeofday({1502634914, 509085}, NULL) = 0

  15. open("/aaa/www/html/hangye.aaa.com/bbb_OPEN_VIDEO/index.php", O_RDONLY) = 31

  

Thread 8 (LWP 2732): #0 0xb6c10644 in pthread_cond_wait () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #1 0x00015a40 in thd_pool_func (pool=0x2d0e78) at thd_pool.c:263 #2 0xb6c14d08 in start_thread () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #3 0xb6bdbc1a in clone () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 Backtrace stopped: previous frame identical to this frame (corrupt stack?) Thread 7 (LWP 2737): #0 0xb6c10644 in pthread_cond_wait () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #1 0x00015a40 in thd_pool_func (pool=0x2d0e78) at thd_pool.c:263 #2 0xb6c14d08 in start_thread () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #3 0xb6bdbc1a in clone () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 Backtrace stopped: previous frame identical to this frame (corrupt stack?) Thread 6 (LWP 2508): #0 0xb6bfe746 in __accept_nocancel () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #1 0xb6bfe792 in accept () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #2 0x00014fe6 in socket_accept_loop (sock_fd=6) at dsd.c:123 #3 0x00014e5e in main (argc=<optimized out>, argv=<optimized out>) at dsd.c:498 Thread 5 (LWP 2736): #0 0xb6bdcc74 in __poll_nocancel () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #1 0xb6bdccbe in poll () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #2 0xb6f1413e in Curl_poll () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #3 0xb6f0e864 in multi_wait.part () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #4 0xb6f0e91a in curl_multi_poll () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #5 0xb6efb40c in curl_easy_perform () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #6 0x0001efcc in https_single_post (ip=ip@entry=0x270538 "192.168.137.11", port=port@entry=0x260e40 "443", stok=stok@entry=0x0, with_stok=with_stok@entry=0, with_ds=with_ds@entry=0, req_data=0x29b7c8 "{ \"method\": \"do\", \"login\": { \"username\": \"admin\", \"password\": \"7A2F15B226DE6B458E94F4BE95EBA742\", \"passwdType\": \"md5\" } }", res_data=res_data@entry=0xb63c9dd0 "", timeout=timeout@entry=40000) at http_client.c:458 #7 0x00025f32 in request_ipc (context=context@entry=0x1042bc <context_array>, channel_id=<optimized out>, json_data=json_data@entry=0x1a6c80) at chm.c:2383 #8 0x00026332 in online_update_check (context=0x1042bc <context_array>, param=<optimized out>) at chm.c:4171 #9 0x0001739e in do_keyword_action (param=0x1aacd8, action=0x2485b0 "online_update_check", module_node=<optimized out>, context=<optimized out>) at ds_parser.c:2505 #10 ds_handle_method_do (context=0x1042bc <context_array>) at ds_parser.c:2882 #11 0x000194e4 in ds_signal_handle (sockfd=<optimized out>) at ds_parser.c:3670 #12 0x00015a7c in thd_pool_func (pool=0x2d0e78) at thd_pool.c:286 #13 0xb6c14d08 in start_thread () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #14 0xb6bdbc1a in clone () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 Backtrace stopped: previous frame identical to this frame (corrupt stack?) Thread 4 (LWP 3280): #0 0xb6bde540 in wait4 () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #1 0xb6bdc290 in waitpid () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #2 0xb6c045c4 in do_system () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #3 0xb6c0472a in system () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #4 0x0001ace0 in new_thread_exec (arg=0x26ef88) at utils.c:298 #5 new_thread_exec (arg=0x26ef88) at utils.c:290 #6 0xb6c14d08 in start_thread () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #7 0xb6bdbc1a in clone () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 Backtrace stopped: previous frame identical to this frame (corrupt stack?) Thread 3 (LWP 2735): #0 0xb6c10644 in pthread_cond_wait () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #1 0x00015a40 in thd_pool_func (pool=0x2d0e78) at thd_pool.c:263 #2 0xb6c14d08 in start_thread () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #3 0xb6bdbc1a in clone () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 Backtrace stopped: previous frame identical to this frame (corrupt stack?) ---Type <return> to continue, or q <return> to quit--- Thread 2 (LWP 2733): #0 0xb6c10644 in pthread_cond_wait () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #1 0x00015a40 in thd_pool_func (pool=0x2d0e78) at thd_pool.c:263 #2 0xb6c14d08 in start_thread () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 #3 0xb6bdbc1a in clone () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/lib/libc.so.0 Backtrace stopped: previous frame identical to this frame (corrupt stack?) Thread 1 (LWP 2734): #0 0xb6f1a82c in curl_strnequal () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #1 0xb6f1bbe6 in Curl_checkheaders () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #2 0xb6f03e76 in Curl_http_host () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #3 0xb6f04b00 in Curl_http () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #4 0xb6f0f5ba in multi_runsingle () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #5 0xb67c8c68 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) bt #0 0xb6f1a82c in curl_strnequal () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #1 0xb6f1bbe6 in Curl_checkheaders () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #2 0xb6f03e76 in Curl_http_host () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #3 0xb6f04b00 in Curl_http () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #4 0xb6f0f5ba in multi_runsingle () from staging_dir/target-arm-openwrt-linux-uclibcgnueabihf/root-mstar/usr/lib/libcurl.so.4 #5 0xb67c8c68 in ?? () 可以根据栈信息分析一下代码挂在哪吗
07-30
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值