JDK中常用的监控和排错工具
从一个普通的Linux用户到JVM语言的开发者,这样的转变花费了不少的时间才完成。Java作为一门工业级编程语言,具有许多成熟的框架和工具。但是通常初学者并不能很好的利用这些工具,如何快速有效的定位程序的问题,一直是博主心中的痛点。结合最近的经历和学习,便写一篇笔记总结,方便以后查阅。
在JVM语言开发中,常常碰到以下问题:
* 程序假死,响应慢
* 线程死锁
* CPU占用率高
* OOM(OutOfMemoryException)
* 内存泄露
……
这些问题一般可以通过重启(程序)来解决,但是这终究是个“制标不治本”的方法。实际上通过定位、理解并解决这些问题才是“正道”,而JDK本身则提供了许多有用的工具来帮助开发人员。常用的工具有jps,jstat,jstack,jmap,jhat。
1.jps - Java Virtual Machine Process Status Tool
jps是一个最常用的指令,用于查看系统中各JVM进程的信息。
$ jps
29907 HeronInstance
30012 MetricsManager
30117 HeronInstance
30013 MetricsManager
16985 QuorumPeerMain
16943 Kafka
如果想要看进程更详细的信息,jps提供了一些选项:
-q
Suppress the output of the class name, JAR file name, and arguments passed to the main method, producing only a list of local VM identifiers.
-m
Output the arguments passed to the main method. The output may be null for embedded JVMs.
-l
Output the full package name for the application's main class or the full path name to the application's JAR file.
-v
Output the arguments passed to the JVM.
-V
Output the arguments passed to the JVM through the flags file (the .hotspotrc file or the file specified by the -XX:Flags=<filename> argument).
-Joption
Pass option to the java launcher called by jps. For example, -J-Xms48m sets the startup memory to 48 megabytes. It is a common convention for -J to pass options to the underlying VM executing applications written in Java.
其中博主常用的有-m
和-l
。-m可以查看传给程序中Main函数的参数(命令行或JVM参数);
$ jps -m
30012 MetricsManager metricsmgr-0 33191 ExclamationTopology ExclamationTopology760273e3-4dbf-4d26-9f64-4ed589b7c830 ./heron-conf/heron_internals.yaml ./heron-conf/metrics_sinks.yaml
30013 MetricsManager metricsmgr-1 42725 ExclamationTopology ExclamationTopology760273e3-4dbf-4d26-9f64-4ed589b7c830 ./heron-conf/heron_internals.yaml ./heron-conf/metrics_sinks.yaml
16943 Kafka /usr/lib/kafka/bin/../config/server.properties
4650 Jps -m
29907 HeronInstance ExclamationTopology ExclamationTopology760273e3-4dbf-4d26-9f64-4ed589b7c830 container_1_exclaim1_1 exclaim1 1 0 stmgr-1 39645 42725 ./heron-conf/heron_internals.yaml
30117 HeronInstance ExclamationTopology ExclamationTopology760273e3-4dbf-4d26-9f64-4ed589b7c830 container_1_word_2 word 2 0 stmgr-1 39645 42725 ./heron-conf/heron_internals.yaml
4495 ZooKeeperMain -server localhost:2181
16985 QuorumPeerMain /usr/lib/zookeeper/bin/../conf/zoo.cfg
-l则可以查看运行的程序的artifactory,来自于哪个Jar包
$ jps -l
30012 com.twitter.heron.metricsmgr.MetricsManager
30013 com.twitter.heron.metricsmgr.MetricsManager
4719 sun.tools.jps.Jps
16943 kafka.Kafka
29907 com.twitter.heron.instance.HeronInstance
30117 com.twitter.heron.instance.HeronInstance
4495 org.apache.zookeeper.ZooKeeperMain
16985 org.apache.zookeeper.server.quorum.QuorumPeerMain
通过Jps,能快速找到程序的pid,方便进一步定位问题。
2.jstat - Java Virtual Machine Statistics Monitoring Tool
以每秒查看GC,共输出10条记录为例子:
$ jstat -gc 29907 100 10
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
512.0 512.0 0.0 0.0 998400.0 0.0 1998848.0 62155.4 83968.0 83968.0 11731 36.405 11713 1234.847 1271.252
512.0 512.0 0.0 0.0 998400.0 0.0 1998848.0 62155.2 83968.0 83968.0 11741 36.432 11723 1235.831 1272.263
512.0 512.0 0.0 0.0 998400.0 0.0 1998848.0 62155.5 83968.0 83968.0 11751 36.458 11733 1236.819 1273.277
512.0 512.0 32.0 0.0 998400.0 0.0 1998848.0 62155.4 83968.0 83968.0 11760 36.485 11742 1237.780 1274.265
512.0 512.0 32.0 0.0 998400.0 0.0 1998848.0 62155.4 83968.0 83968.0 11768 36.519 11750 1238.665 1275.184
512.0 512.0 0.0 0.0 998400.0 0.0 1998848.0 62155.4 83968.0 83968.0 11777 36.548 11759 1239.706 1276.254
512.0 512.0 0.0 0.0 998400.0 0.0 1998848.0 62155.4 83968.0 83968.0 11787 36.574 11768 1240.743 1277.317
512.0 512.0 0.0 0.0 998400.0 0.0 1998848.0 62155.4 83968.0 83968.0 11796 36.604 11778 1241.634 1278.238
512.0 512.0 0.0 0.0 998400.0 0.0 1998848.0 62155.2 83968.0 83968.0 11806 36.634 11788 1242.676 1279.310
512.0 512.0 0.0 0.0 998400.0 0.0 1998848.0 62155.3 83968.0 83968.0 11815 36.661 11797 1243.645 1280.306
S0C
, S1C
, S0U
, S1U
代表Survior 0和1区的capacity和usage
EC
, EU
, OC
, OU
, PC
, PU
代表Eden,Old,Permanent区的容量和使用量
YGC
, YGCT
, FGC
, FGCT
代表从JVM启动以来Young区GC和Full GC的次数和耗时(Time, 单位为秒)
GCT
代表GC总耗时,是YGCT和FGCT之和
从上可以看出,该JVM程序卡住的问题很可能由于FGC次数多。如果不懂JVM中对内存的划分和管理,可以参考另一篇文。Troubleshooting Memory Issues in Java Applications
3.jstack - Stack Trace
调试程序时,一个重要的关注点就是stack trace,根据堆栈信息就可以找到具体的代码位置,以免盲目排查问题。jstack可以打印出指定进程、core dump文件或者远程JVM中线程的stack trace信息。
jstack -l 29907
Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.80-b11 mixed mode):
"Attach Listener" daemon prio=10 tid=0x00007f7c24001000 nid=0x171f waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
"DestroyJavaVM" prio=10 tid=0x00007f7c5c00b800 nid=0x74d4 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
"SlaveThread" prio=10 tid=0x00007f7c5c298800 nid=0x74e4 in Object.wait() [0x00007f7c45df8000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000ee1cd530> (a java.lang.Object)
at com.twitter.heron.common.basics.SlaveLooper.doWait(SlaveLooper.java:47)
- locked <0x00000000ee1cd530> (a java.lang.Object)
at com.twitter.heron.common.basics.WakeableLooper.runOnce(WakeableLooper.java:72)
at com.twitter.heron.common.basics.WakeableLooper.loop(WakeableLooper.java:64)
at com.twitter.heron.instance.Slave.run(Slave.java:169)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x00000000ee0407e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"GatewayThread" prio=10 tid=0x00007f7c5c293000 nid=0x74e3 runnable [0x00007f7c45ef9000]
java.lang.Thread.State: RUNNABLE
……
其中关于线程的基础信息如下
"SlaveThread" prio=10 tid=0x00007f7c5c298800 nid=0x74e4 in Object.wait() [0x00007f7c45df8000]
| | | | | + ???
| | | | + 说明在等待Object的锁
| | | + 在OS中线程的id
| | + 在JVM中的线程id,JVM的线程是基于OS的线程
| + 优先级
+ 线程类型
可以看出该线程处于等待状态:- waiting on <0x00000000ee1cd530> (a java.lang.Object)
接着查看进程里哪些线程的CPU占用率高
$ top -Hp 29907
Threads: 107 total, 4 running, 103 sleeping, 0 stopped, 0 zombie
%Cpu(s): 60.9 us, 1.2 sy, 0.1 ni, 37.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 11991800 total, 3699040 free, 6984008 used, 1308752 buff/cache
KiB Swap: 3963900 total, 3637072 free, 326828 used. 4296828 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29916 root 20 0 5505572 709388 29300 S 54.0 5.9 61:20.00 java
29914 root 20 0 5505572 709388 29300 R 42.3 5.9 45:09.51 java
29913 root 20 0 5505572 709388 29300 R 41.7 5.9 45:08.21 java
29915 root 20 0 5505572 709388 29300 R 41.3 5.9 45:07.39 java
29912 root 20 0 5505572 709388 29300 R 39.3 5.9 45:08.47 java
其中有五条线程的CPU占用率高,而29916是最高的,那么先来看看它在做什么。
$ jstack 29907 | grep `printf "%x\n" 29916` # 系统的pid 29916对应jstack中输出的nid,所以需要转换成十六进制
"VM Thread" prio=10 tid=0x00007fd8cc076000 nid=0xe98 runnable
这是一条JVM自己的线程,没有提供太多有用的信息,那么来看看剩下四条线程在干什么
$ jstack 29907 | grep `printf "%x\n" 29914`
"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fd8cc030000 nid=0xe96 runnable
$ jstack 29907 | grep `printf "%x\n" 29915`
^[[A"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fd8cc031800 nid=0xe97 runnable
$ jstack 29907 | grep `printf "%x\n" 29913`
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fd8cc02e000 nid=0xe95 runnable
$ jstack 29907 | grep `printf "%x\n" 29912`
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fd8cc02c000 nid=0xe94 runnable
四条GC task线程,和上文jstat中的结论一致,说明GC占用了大量CPU资源,导致程序卡顿。
4.jmap - Memory Map and jhat - Java Heap Analysis Tool
jmap用来查看堆内存使用状况,一般结合jhat使用。
值得注意的是,使用jmap采集信息时,最好停止该JVM进程的操作,即停止与外部的交互。一方面,这样才能采集到准确的heap信息,另一方面jmap会使JVM进程或者内部线程停止一段时间,直到采集完成。
-histo 和-heap 工作原理是不同的:
- -histo使用了 Dynamic Attach机制,运行在被采集的JVM进程里,会停止JVM里原有的各线程;
- -heap则通过HotSpot Serviceability Agent,运行在采集工具jmap本身的进程里,会停止整个JVM进程;
- 无论那种方式,一旦被采集的JVM进程仍持续产生大量Object,jmap就有OOM的风险。
- 详细信息可以查看jmap vs. jmap -F, as well as jstack vs. jstack -F use completely different mechanisms to communcate with the target JVM.
关于”停止时间”,heap的大小会决定进程/线程停止多久。例如,假设dump一个1GB的heap文件,耗时肯定是以秒来记的。
以下是jmap的参数解释:
-dump:[live,]format=b,file=<filename>
Dumps the Java heap in hprof binary format to filename. The live suboption is optional. If specified, only the live objects in the heap are dumped. To browse the heap dump, you can use jhat (Java Heap Analysis Tool) to read the generated file.
-finalizerinfo
Prints information on objects awaiting finalization.
-heap
Prints a heap summary. GC algorithm used, heap configuration and generation wise heap usage are printed.
-histo[:live]
Prints a histogram of the heap. For each Java class, number of objects, memory size in bytes, and fully qualified class names are printed. VM internal class names are printed with '*' prefix. If the live suboption is specified, only live objects are counted.
-permstat
Prints class loader wise statistics of permanent generation of Java heap. For each class loader, its name, liveness, address, parent class loader, and the number and size of classes it has loaded are printed. In addition, the number and size of interned Strings are printed.
-F
Force. Use with jmap -dump or jmap -histo option if the pid does not respond. The live suboption is not supported in this mode.
-h
Prints a help message.
-help
Prints a help message.
-J<flag>
Passes <flag> to the Java virtual machine on which jmap is run.
$ jmap -J-d64 -heap 29907 # -J-d64代表64位的JVM, -heap查看进程堆内存使用情况,包括使用的GC算法、堆配置参数和各代中堆内存使用情况
Attaching to process ID 29907, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.80-b11
using thread-local object allocation.
Parallel GC with 4 thread(s)
Heap Configuration:
MinHeapFreeRatio = 0
MaxHeapFreeRatio = 100
MaxHeapSize = 3070230528 (2928.0MB)
NewSize = 1310720 (1.25MB)
MaxNewSize = 17592186044415 MB
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 21757952 (20.75MB)
MaxPermSize = 85983232 (82.0MB)
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
PS Young Generation
Eden Space:
capacity = 1022361600 (975.0MB)
used = 0 (0.0MB)
free = 1022361600 (975.0MB)
0.0% used
From Space:
capacity = 524288 (0.5MB)
used = 0 (0.0MB)
free = 524288 (0.5MB)
0.0% used
To Space:
capacity = 524288 (0.5MB)
used = 0 (0.0MB)
free = 524288 (0.5MB)
0.0% used
PS Old Generation
capacity = 2046820352 (1952.0MB)
used = 63744656 (60.79164123535156MB)
free = 1983075696 (1891.2083587646484MB)
3.114325882958584% used
PS Perm Generation
capacity = 85983232 (82.0MB)
used = 85983216 (81.99998474121094MB)
free = 16 (1.52587890625E-5MB)
99.99998139172065% used
21007 interned Strings occupying 3147464 bytes.
$ jmap -J-d64 -permstat 29907 # 统计进程permanent区的信息
Attaching to process ID 29907, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.80-b11
finding class loader instances ..done.
computing per loader stat ..done.
please wait.. computing liveness...............................liveness analysis may be inaccurate ...
class_loader classes bytes parent_loader alive? type
<bootstrap> 972 5868136 null live <internal>
0x00000000ee14a4f8 1 1888 0x00000000ee004388 dead sun/reflect/DelegatingClassLoader@0x00000000f8050c70
0x00000000ee058700 1 1888 0x00000000ee004388 dead sun/reflect/DelegatingClassLoader@0x00000000f8050c70
0x00000000ee004388 751 5438056 0x00000000ee01bfd8 live sun/misc/Launcher$AppClassLoader@0x00000000f821b748
0x00000000ee01bfd8 0 0 null live sun/misc/Launcher$ExtClassLoader@0x00000000f81c0d78
0x00000000ee058740 1 3056 0x00000000ee004388 dead sun/reflect/DelegatingClassLoader@0x00000000f8050c70
0x00000000ee058780 1 3056 0x00000000ee004388 dead sun/reflect/DelegatingClassLoader@0x00000000f8050c70
0x00000000ee058840 1 3056 0x00000000ee004388 dead sun/reflect/DelegatingClassLoader@0x00000000f8050c70
0x00000000ee0587c0 1 1888 0x00000000ee004388 dead sun/reflect/DelegatingClassLoader@0x00000000f8050c70
0x00000000ee06e8c8 0 0 0x00000000ee004388 dead java/util/ResourceBundle$RBClassLoader@0x00000000f849a5b8
0x00000000ee058800 1 1888 0x00000000ee004388 dead sun/reflect/DelegatingClassLoader@0x00000000f8050c70
0x00000000ee0586c0 1 3056 0x00000000ee004388 dead sun/reflect/DelegatingClassLoader@0x00000000f8050c70
total = 12 1731 11325968 N/A alive=3, dead=9 N/A
$ jmap -J-d64 -histo:live 29907 # 统计堆中存活对象的数目
num #instances #bytes class name
----------------------------------------------
1: 124715 44776120 [S
2: 143040 19941440 <constMethodKlass>
3: 143040 18319424 <methodKlass>
4: 12237 16013592 <constantPoolKlass>
5: 12237 11496200 <instanceKlassKlass>
6: 10019 8319232 <constantPoolCacheKlass>
7: 46771 6290840 [C
8: 41115 6217536 [B
9: 1553 4303592 <methodDataKlass>
10: 13045 1518056 java.lang.Class
11: 39 1278576 [Lscala.concurrent.forkjoin.ForkJoinTask;
12: 23396 1206288 [[I
13: 36613 1171616 java.util.Hashtable$Entry
14: 45121 1082904 java.lang.String
15: 31037 993184 java.util.HashMap$Entry
16: 22930 733760 java.util.concurrent.ConcurrentHashMap$HashEntry
对class name
这一列,需要对其含义说明一下:
B byte
C char
D double
F float
I int
J long
Z boolean
[ 数组,如[I表示int[]
[L+类名 其他对象
< > Klass
当jmap的输出信息过多,不方便在命令行查看时,可以先dump成文件,在通过jhat在浏览器中详细查看。
$ jmap -dump:format=b,file=file.dump 29907
Dumping heap to /home/horatio/file.dump ...
Heap dump file created
$ jhat -J-Xmx256m -port 20000 ~/file.dump # -J-Xmx指定最大堆内存,用于加载较大的dump的文件
Reading from /home/horatio/file.dump...
Dump file created Sat Sep 17 17:26:21 HKT 2016
Snapshot read, resolving...
Resolving 483745 objects...
Chasing references, expect 96 dots................................................................................................
Eliminating duplicate references................................................................................................
Snapshot resolved.
Started HTTP server on port 20000
Server is ready.
接着在浏览器中访问http://localhost:20000
, 点击对应的类,可以查看详细的信息