-
L :解决CPU占用率过高的问题
昨天 23:16问题描述 一台M4000主机,操作系统是solaris10,上面的resin进程cpu占用率过高,达到了70%,如下: view source print? 1 -bash-3.00$ ps -ef -o pid,pcpu,args|grep java<BR> 1511 0.1 /usr/java/bin/java -Dwebview.htdocs=/etc/opt/FJSVwvcnf/htdocs/FJSVwvbs -mx128m <BR> 2135 0.0 /usr/java/bin/java -server -Xmx128m -XX:+BackgroundCompilation -XX:PermSize=32m<BR>15945 0.0 sh -c /svi/jdk150/jdk1.5.0_06/bin/java -server -Xms512m -Xmx3072m -XX:MaxPe<...
问题描述
一台M4000主机,操作系统是solaris10,上面的resin进程cpu占用率过高,达到了70%,如下:
1
-bash-3.00$ ps -ef -o pid,pcpu,args|grep java<BR> 1511 0.1 /usr/java/bin/java -Dwebview.htdocs=/etc/opt/FJSVwvcnf/htdocs/FJSVwvbs -mx128m <BR> 2135 0.0 /usr/java/bin/java -
server
-Xmx128m -XX:+BackgroundCompilation -XX:PermSize=32m<BR>15945 0.0 sh -c /svi/jdk150/jdk1.5.0_06/bin/java -
server
-Xms512m -Xmx3072m -XX:MaxPe<BR>15946 70.7 /svi/jdk150/jdk1.5.0_06/bin/java -
server
-Xms512m -Xmx3072m -XX:MaxPermSize=
排查过程
1. 首先需要查找cpu占用率过高的LWP
1
-bash-3.00$ prstat -L -p 15946
2
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID
3
15946 slview 3336M 3301M sleep 15 0 3:56:27 2.2% java/49
4
15946 slview 3336M 3301M sleep 8 0 3:33:17 2.2% java/52
5
15946 slview 3336M 3301M sleep 12 0 3:32:20 2.2% java/50
6
15946 slview 3336M 3301M sleep 13 0 3:29:43 2.2% java/51
7
15946 slview 3336M 3301M sleep 13 0 3:30:54 2.2% java/47
8
15946 slview 3336M 3301M sleep 12 0 1:24:19 2.2% java/64
9
15946 slview 3336M 3301M sleep 15 0 1:07:55 2.1% java/144
2. 查找LWP与java线程的对应关系
01
-bash-3.00$ pstack 15946|grep lwp
02
----------------- lwp# 47 / thread# 47 --------------------
03
ff2c49fc _lwp_start (0, 0, 0, 0, 0, 0)
04
----------------- lwp# 48 / thread# 48 --------------------
05
ff2c5cd0 lwp_cond_wait (1704928, 1704910, 0, 0)
06
ff2c49fc _lwp_start (0, 0, 0, 0, 0, 0)
07
----------------- lwp# 49 / thread# 49 --------------------
08
ff2c49fc _lwp_start (0, 0, 0, 0, 0, 0)
09
----------------- lwp# 50 / thread# 50 --------------------
10
ff2c49fc _lwp_start (0, 0, 0, 0, 0, 0)
11
----------------- lwp# 51 / thread# 51 --------------------
12
ff2c49fc _lwp_start (0, 0, 0, 0, 0, 0)
13
----------------- lwp# 52 / thread# 52 --------------------
14
ff2c49fc _lwp_start (0, 0, 0, 0, 0, 0)
3. used the jstack <pid> find the callstack of thread
$ jstack -m 15946 获取所有线程的调用堆桟
01
hread t@50: (state = IN_VM)
02
- java.lang.AbstractStringBuilder.expandCapacity(
int
) @bci=28, line=99 (Compiled frame; information may be imprecise)
03
- per.xwnmp.flux.report.RptFluxHisQuery.GetFluxData(java.lang.String[], java.util.HashMap, java.lang.String, java.lang.Stri
04
ng, java.lang.String, java.lang.String, java.lang.String) @bci=480, line=509 (Interpreted frame)
05
- per.xwnmp.flux.report.RptFluxHisQuery.GenFluxReport(java.lang.String, java.lang.String[], java.lang.String, java.lang.Str
06
ing, java.lang.String, java.lang.String, java.lang.String) @bci=124, line=82 (Interpreted frame)
07
- _nos._flux._flux._FluxPerfView_0Excel__jsp._jspService(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletRespo
08
nse) @bci=930, line=162 (Interpreted frame)
09
- com.caucho.jsp.JavaPage.service(javax.servlet.ServletRequest, javax.servlet.ServletResponse) @bci=9, line=75 (Interpreted frame)
10
- com.caucho.jsp.Page.subservice(com.caucho.
server
.http.CauchoRequest, com.caucho.
server
.http.CauchoResponse) @bci=214, line=506 (I
11
- com.caucho.
server
.TcpConnection.run() @bci=73, line=139 (Interpreted frame)
12
- java.lang.Thread.run() @bci=11, line=595 (Interpreted frame)
13
14
15
Thread t@52: (state = IN_VM)
16
- per.xwnmp.flux.report.RptFluxHisQuery.GetFluxData(java.lang.String[], java.util.HashMap, java.lang.String, java.lang.Stri
17
ng, java.lang.String, java.lang.String, java.lang.String) @bci=435, line=508 (Compiled frame; information may be imprecise)
18
- per.xwnmp.flux.report.RptFluxHisQuery.GenFluxReport(java.lang.String, java.lang.String[], java.lang.String, java.lang.Str
19
ing, java.lang.String, java.lang.String, java.lang.String) @bci=124, line=82 (Interpreted frame)
20
- _nos._flux._flux._FluxPerfView_0Excel__jsp._jspService(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletRespo
21
nse) @bci=930, line=162 (Interpreted frame)
22
- com.caucho.jsp.JavaPage.service(javax.servlet.ServletRequest, javax.servlet.ServletResponse) @bci=9, line=75 (Interpreted frame)
23
- com.caucho.jsp.Page.subservice(com.caucho.
server
.http.CauchoRequest, com.caucho.
server
.http.CauchoResponse) @bci=214, line=506 (I
24
nterpreted frame)
1、首先使用prstat命令查找cpu占用高的进程。
2、向该进程发出输出Thread Dump的指令:Kill -3 PID,本例子为Kill -3 1792,该进程将把Thread Dump输出到日志文件中,附件的文件为本例的Thread Dump的输出文件
3、使用prstat -L查找与上述PID相同的线程,显示其每个线程的CPU占用率
4、查找CPU占用高的线程的LWPID将其转换成十六进制与Thread Dump中的每个Thread的nid进行对应,总结一下公式即为Hex(LWPID)=nid
5、查找该nid对应的线程调用关系,主要查找自主开发的代码
6、根据上述查找得出结论,具体由于那个业务或者代码导致了CPU占用高。本文中所有的CPU占用高的线程都指向了同一个段代码,这样问题就排查出来了。2010年08月26日 星期四 21:201. 确定占用cpu高的线程id:
方法一:直接使用 ps Hh -eo pid,tid,pcpu | sort -nk3 |tail 获取对于的进程号和线程号,然后跳转到3.
方法二: . 查看哪个进程线程占用cpu过高; top / ps -aux, 获得进程号
. 确定哪个线程占用cpu过高,进入进程号的目录:/proc/pid/task,
执行:grep SleepAVG **/status | sort -k2,2 | head, 确定cpu占用较高的线程号。
2.使用:jstack pid, 或者kill -3 pid 会打印线程堆栈的情况。 jstack输出到当前命令的标准输出,kill -3 输出到当前进程jvm的标准输出。根据第二步中获取的线程号,查询堆栈中正在执行的代码。输出示例如下:
Thread 2060: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
Error occurred during stack walking:
Thread 2059: (state = IN_NATIVE)
- java.net.PlainSocketImpl.socketAccept(java.net.SocketImpl) @bci=0 (Interpreted frame)
- java.net.PlainSocketImpl.accept(java.net.SocketImpl) @bci=7, line=384 (Interpreted frame)
- java.net.ServerSocket.implAccept(java.net.Socket) @bci=50, line=450 (Interpreted frame)
- java.net.ServerSocket.accept() @bci=48, line=421 (Interpreted frame)
- org.apache.jk.common.ChannelSocket.accept(org.apache.jk.core.MsgContext) @bci=46, line=293 (Interpret
ed frame)
- org.apache.jk.common.ChannelSocket.acceptConnections() @bci=68, line=647 (Interpreted frame)
- org.apache.jk.common.ChannelSocket$SocketAcceptor.runIt(java.lang.Object[]) @bci=4, line=857 (Interpr
eted frame)
- org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() @bci=167, line=684 (Interpreted frame
)是否是数据库连接频繁