记一次Thrift Server错误排查

最新推荐文章于 2025-10-03 04:52:44 发布

原创最新推荐文章于 2025-10-03 04:52:44 发布 · 7.2k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#hbase #邮件 #线程 #tcp

解决问题专栏收录该内容

2 篇文章

订阅专栏

1. 起因

1.1. 昨夜下班回家后11点左右突然收到hbase监控的连接失败邮件。

1.2. 上线查看监控日志，发现均是timeout异常

1.3. 查看消费者日志，发现所有连接不可用，不断重试中，遂关闭所有消费者

2. 进一步排查

2.1. 早上九点到公司开始继续排查问题

2.2. 查看hbase thrift 日志，发现大量重试，发现thrift处理线程已满，大量等待任务

$ tail -n 1000 hbase-hbase-thrift2-idc02-argo-ds-743.log
2017-02-21 09:50:40,578 ERROR [pool-1-thread-49085] thrift.ProcessFunction: Internal error processing get
java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.util.BoundedCompletionService$QueueingFuture@6d8d5d1d rejected from java.util.concurrent.ThreadPoolExecutor@42d2ee91[Running, pool size = 256, active threads = 256, queued tasks = 25600, completed tasks = 37016]
......
Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.util.BoundedCompletionService$QueueingFuture@6d8d5d1d rejected from java.util.concurrent.ThreadPoolExecutor@42d2ee91[Running, pool size = 256, active threads = 256, queued tasks = 25600, completed tasks = 37016]
......
2017-02-21 10:01:49,958 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=26, retries=35, started=432112 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:02:10,155 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=27, retries=35, started=452309 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:02:30,189 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=28, retries=35, started=472343 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:02:50,386 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=29, retries=35, started=492540 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:03:10,495 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=30, retries=35, started=512649 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:03:30,689 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=31, retries=35, started=532843 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:03:50,717 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=32, retries=35, started=552871 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:04:10,829 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=33, retries=35, started=572983 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:04:30,857 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=34, retries=35, started=593011 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null

2.3. 查看服务器TCP连接发现大量CLOSE_WAIT连接

$ netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
TIME_WAIT 29
CLOSE_WAIT 16440
FIN_WAIT1 13
ESTABLISHED 62

3. 分析问题

客户端请求过多，导致服务器线程池无法处理，一部分请求直接处于等待完成状态。此时客户端由于设置timeout，于是超时后放弃连接并起新的连接。由于服务器还在处理其他任务，导致之前的连接无法正常close，于是这部分连接处于CLOSE_WAIT状态。
通过日志发现thrift自己本身会对请求失败的任务不断重试，并且重试次数高达35次，这也导致了处理速度过慢
服务器大量的CLOSE_WAIT说明系统CLOSE_WAIT设置时间过长