前言
在看skywalking-agent源码时发现agent会采集线程的状态,但这些状态是什么场景触发的呢?当线程数超过阈值后,我们怎样去更快地定位问题呢?带上这样的疑问进行下面的分析
线程状态分析
-
在java.lang.Thread类的内部枚举类State中定义了Java线程的六个状态
public enum State { /** * Thread state for a thread which has not yet started. */ NEW, /** * Thread state for a runnable thread. A thread in the runnable * state is executing in the Java virtual machine but it may * be waiting for other resources from the operating system * such as processor. */ RUNNABLE, /** * Thread state for a thread blocked waiting for a monitor lock. * A thread in the blocked state is waiting for a monitor lock * to enter a synchronized block/method or * reenter a synchronized block/method after calling * {@link Object#wait() Object.wait}. */ BLOCKED, /** * Thread state for a waiting thread. * A thread is in the waiting state due to calling one of the * following methods: * <ul> * <li>{@link Object#wait() Object.wait} with no timeout</li> * <li>{@link #join() Thread.join} with no timeout</li> * <li>{@link LockSupport#park() LockSupport.park}</li> * </ul> * * <p>A thread in the waiting state is waiting for another thread to * perform a particular action. * * For example, a thread that has called <tt>Object.wait()</tt> * on an object is waiting for another thread to call * <tt>Object.notify()</tt> or <tt>Object.notifyAll()</tt> on * that object. A thread that has called <tt>Thread.join()</tt> * is waiting for a specified thread to terminate. */ WAITING, /** * Thread state for a waiting thread with a specified waiting time. * A thread is in the timed waiting state due to calling one of * the following methods with a specified positive waiting time: * <ul> * <li>{@link #sleep Thread.sleep}</li> * <li>{@link Object#wait(long) Object.wait} with timeout</li> * <li>{@link #join(long) Thread.join} with timeout</li> * <li>{@link LockSupport#parkNanos LockSupport.parkNanos}</li> * <li>{@link LockSupport#parkUntil LockSupport.parkUntil}</li> * </ul> */ TIMED_WAITING, /** * Thread state for a terminated thread. * The thread has completed execution. */ TERMINATED; }
-
在操作系统层面,线程存在五类状态(状态流转如下图)
2.1 Java中的RUNNABLE包含了OS中的RUNNING和READY
2.2 Java中的WAITING、TIMED_WAITING、BLOCKED对应OS中的WAITING -
下面按照java.lang.Thread$State中对状态的描述场景去模拟测试,主要分析WAITING、TIMED_WAITING、BLOCKED
BLOCKED状态模拟
- BLOCKED状态注释翻译:线程阻塞等待监视器锁的线程状态。 处于阻塞状态的线程正在等待监视器锁进入同步块/方法或调用 Object.wait 后重新进入同步块/方法
- 场景1:处于阻塞状态的线程正在等待监视器锁进入同步块/方法
2.1 模拟代码如下
2.2 thread1执行monitor.wait()前,thread2因要等待监视器所有权BLOCKED住了@Test public void testSynchronizedWaitBlock(){ Object monitor = new Object(); Thread thread1 = new Thread(()->{ synchronized (monitor){ try { monitor.wait(); } catch (InterruptedException e) { System.out.println("捕获到InterruptedException"); throw new RuntimeException(e); } System.out.println("执行thread1方法"); } },"thread1"); Thread thread2 = new Thread(()->{ synchronized (monitor){ monitor.notifyAll(); System.out.println("执行thread2方法"); } },"thread2"); thread1.start(); thread2.start(); System.out.println("断点暂停主线程"); }
- 场景2:调用Object.wait 后重新进入同步块/方法
3.1 thread1执行monitor.wait()后并且thread2执行为monitor.notifyAll()前,thread1释放监视器所有权并等待,线程状态为WAITING。thread2的状态为RUNNABLE
3.2 thread2执行为monitor.notifyAll()后,thread1等待获取监视器的所有权并恢复执行,状态变为BLOCKED
WAITING状态模拟
-
WAITING状态注释翻译:处于等待状态的线程正在等待另一个线程执行特定操作,调用以下方法之一,线程处于等待状态:
1.1 没有超时参数的Object.wait
1.2 没有超时参数的Thread.join
1.3 LockSupport.park -
场景1:没有超时参数的Thread.join
2.1 模拟代码如下// Thread.join 等待这个线程死掉 @Test public void testJoin(){ Thread thread1 = new Thread(()->{ System.out.println("执行thread1方法....."); },"thread1"); Thread thread2 = new Thread(()->{ try { thread1.join(); } catch (InterruptedException e) { throw new RuntimeException(e); } System.out.println("执行thread2方法....."); },"thread2"); thread1.start(); thread2.start(); System.out.println("断点暂停主线程"); }
2.2 thread2执行thread1.join等待thread1执行完,此时thread2的状态为WAITING
-
场景2:LockSupport.park
3.1 模拟代码如下/** * 除非许可可用,否则禁用当前线程以进行线程调度 * --> 当前线程将出于线程调度目的而被禁用并处于休眠状态,直到发生以下三种情况之一: * 1. 其他一些线程以当前线程为目标调用unpark * 2. 其他一些线程中断当前线程 * @throws Exception */ @Test public void testLockSupport(){ Thread thread1 = new Thread(() -> { System.out.println("do something start"); LockSupport.park(); System.out.println("do something end"); },"thread1"); thread1.start(); System.out.println("给子线程thread增加一个许可"); LockSupport.unpark(thread1); System.out.println("主线程执行完毕"); }
3.2 thread1调用LockSupport.park()将当前线程thread1处于休眠状态,状态为WAITING
3.3 主线程调用LockSupport.unpark(thread1)将thread1唤醒,thread1继续执行,状态为RUNNABLE
TIME_WAITING状态模拟
- TIME_WAITING状态注释翻译:具有指定等待时间的等待线程的线程状态,调用以下方法之一,线程处于定时等待状态:
1.1 Thread.sleep
1.2 带有超时参数的Object.wait
1.3 带有超时参数的Thread.join
1.4 LockSupport.parkNanos
1.5 LockSupport.parkUntil - 场景1:Thread.sleep
2.1 模拟代码如下:
2.2 thread1获取monitor监视器的所有权,调用sleep使当前线程休眠,但不会丢失监视器的所有权,所以thread1的状态为TIMED_WAITING,thread2的状态为BLOCKED/** * 使当前执行的线程休眠(暂时停止执行)指定的毫秒数, * 取决于系统计时器和调度程序的精度和准确性。 该线程不会失去任何监视器的所有权 */ @Test public void testSleep(){ Object monitor = new Object(); Thread thread1 = new Thread(()->{ synchronized (monitor){ try { Thread.sleep(1000 * 10); } catch (InterruptedException e) { System.out.println("捕获到InterruptedException"); throw new RuntimeException(e); } System.out.println("执行thread1方法"); } },"thread1"); Thread thread2 = new Thread(()->{ synchronized (monitor){ System.out.println("执行thread2方法"); } },"thread2"); thread1.start(); thread2.start(); System.out.println("断点暂停主线程"); }
线程数预警后如何排查问题
- 线程最大数量由JVM的堆(-Xmx,-Xms)大小、Thread的栈(-Xss)内存大小、系统最大可创建的线程数的限制参数三个方面影响。不考虑系统限制,可以通过这个公式估算:
线程数量 = (机器本身可用内存 - JVM分配的堆内存) / Xss的值 - 当线程数超过阈值后,通过arthas的thread -b命令找出当前阻塞其他线程的线程
束语
个人认为统计线程信息是为监控系统健康状况,当线程数达到预设阈值后,应该将当时线程堆栈信息打印出来保存起来(维护现场)并及时进行告警通知,后续更好的定位问题。
(在skywalking中没有找到这样的设置,希望大佬指点~~~~)