Linux Load Average

本文详细探讨了Linux负载平均数的概念、计算过程及其与性能分析的关系,包括负载平均数的定义、如何计算以及与其他性能指标的比较。

Linux Load Average
Not Your Average Average1

Dr. Neil J. Gunther
Performance Dynamics Company SM
Castro Valley, California, USA
www.perfdynamics.com

Originally presented Sep 4, 2002
Updated by NJG Feb 21, 2003

In order to view the mathematical notations correctly, check here before continuing.


What's This Talk About?

Averages are important for performance analysis and capacity planning. There are many manifestations of averages e.g., arithmetic average (the usual one), moving average (used in financial forecasting), geometric average (used in the SPEC benchmarks), harmonic average (not used enough), and so on.

Other averages are taken over time i.e., time-dependent averages. A particular example of such a time-dependent average is the load average metric that appears in certain UNIX (and therefore Linux) commands. Have you ever wondered how those three little numbers are produced?

In this presentation, I shall start at the surface (the shell) and gradually submerge into the depths of the Linux kernel to find out how the Linux load average gets calculated.

Finally, I'll compare the load average with other averaging techniques used in performance analysis and capacity planning.


What is the Load Average?

Appears in the ASCII output of certain UNIX commands ...
[pax:~]% uptime 
    9:40am  up 9 days, 10:36,  4 users,  load average: 0.02, 0.01, 0.00 

And on Linux systems ...

[pax:~]% procinfo
    Linux 2.0.36 (root@pax) (gcc 2.7.2.3) #1 Wed Jul 25 21:40:16 EST 2001 [pax]
    
    Memory:      Total        Used        Free      Shared     Buffers      Cached
    Mem:         95564       90252        5312       31412       33104       26412
    Swap:        68508           0       68508
    
    Bootup: Sun Jul 21 15:21:15 2002    Load average: 0.15 0.03 0.01 2/58 8557
    ...

Three numbers: 1-, 5-, and 15-, minute averages of .... ?


How the Guru's Define LOAD ...

Man Pages (oops!)
        [pax:~]% man "load average"
        No manual entry for load average
Tim O'Reilly and Crew, p.726

The load average tries to measure the number of active processes at any time. As a measure of CPU utilization, the load average is simplistic, poorly defined, but far from useless.

Adrian Cockcroft, p.229

The load average is the sum of the run queue length and the number of jobs currently running on the CPUs. In Solaris 2.0 and 2.2 the load average did not include the running jobs but this bug was fixed in Solaris 2.3.


Graphical Display of Load Average

Can be displayed as a time series

LAdaily.gif

like that produced by ORCA.


What is an ``Average'' Load?

Tim O'Reilly and Crew

What's high? ... Ideally, you'd like a load average under, say, 3, ... Ultimately, 'high' means high enough so that you don't need uptime to tell you that the system is overloaded.

... different systems will behave differently under the same load average. ... running a single cpu-bound background job .... can bring response to a crawl even though the load avg remains quite low.

Blair Zajac (ORCA Author)

If long term trends indicate increasing figures, more or faster CPUs will eventually be necessary unless load can be displaced. For ideal utilization of your CPU, the maximum value here should be equal to the number of CPUs in the box.

Some hedging because the load average is not your average kind of average. It's a time-dependent average ... a damped time-dependent average.

But you're a Linux expert and you knew this already. Right?
Let's find out ...


lababes.gif
``The LA Triplets'' Quiz

Random Samples
In each of these samples:

        A. load average:  6.85,  7.37, 7.83
        B. load average:  8.50, 10.93, 8.61
        C. load average: 37.34,  9.47, 3.30

is the load:

  1. Increasing
  2. Decreasing
  3. Stationary
  4. Can't decide

Sequential Samples
Here are some load averages monitored in sequence by sampling them over a 5 hour period (e.g., using the uptime command) at each of the times shown in the left-most column.

         8:00am  load average: 1.21   0.81  0.13
         8:10am  load average: 37.34  9.47  3.30
         8:50am  load average: 19.21 16.02  7.40
         9:15am  load average: 13.92 15.13  8.18
         9:40am  load average: 10.51 13.50  8.47
        10:30am  load average:  8.50 10.93  8.61
        11:00am  load average:  8.15  9.84  8.55
        11:20am  load average:  7.72  9.20  8.44
         1:00pm  load average:  6.85  7.37  7.83
Imagine a sysadm running the uptime command at those wall-clock times.

In which LA sample does maximum load occur?

  1. LA sample taken at 9:15am
  2. LA sample taken at 8:50am
  3. LA sample taken at 11:00am
  4. LA sample taken at 10:30am

Excluding the first LA sample at 8am, in which sample does least load occur?:

  1. LA sample taken at 8:10am
  2. LA sample taken at 11:20am
  3. LA sample taken at 1:00pm

Visual Hints
Numeric triples are convenient for computers but hard on sysam's.
The following diagram shows the 10-minute load averages above graphically.

trip010.gif

  
The 3 dots correspond to the 3 numeric LA values. The y-axis shows the load values and the x-axis shows a range of time between 1 and 15 minutes. The left-most point represents the 1-minute load average, the middle point represents the 5-minute load average and the right-most the 15-minute load average.

Here is an animation of the above sequence.

tripani.gif

End of Quiz


Simple Experiment

Two hot-loops initiated in background on single-CPU Linux box. Two phases in the test over the course of 1 hour:
  • CPU pegged for 2100 seconds then processes killed.
  • CPU quiescent for the remaining 1500 seconds.

Perl script sampled load average every 5 minutes using uptime


Experimental Results  2

LALinuxTest.gif
  • 1-minute LA reaches a value of 2.0 after 300 seconds into the test
  • 5-minute LA reaches 2.0 around 1200 seconds
  • 15-minute LA would reach 2.0 at  4500 seconds (but processes killed at 2100 seconds)

(Resembles the charging/discharging of an RC circuit)


Into the Depths ...

http://lxr.linux.no/source/kernel/...
        
        unsigned long avenrun[3];
        624 
        625 static inline void calc_load(unsigned long ticks)
        626 {
        627         unsigned long active_tasks; /* fixed-point */
        628         static int count = LOAD_FREQ;
        629 
        630         count -= ticks;
        631         if (count < 0) {
        632                 count += LOAD_FREQ;
        633                 active_tasks = count_active_tasks();
        634                 CALC_LOAD(avenrun[0], EXP_1, active_tasks);
        635                 CALC_LOAD(avenrun[1], EXP_5, active_tasks);
        636                 CALC_LOAD(avenrun[2], EXP_15, active_tasks);
        637         }
        638 }

The sampling interval of LOAD_FREQ is once every 5 HZ. How often is that?


LA Sampling Interval

Recall that:
        1 HZ    =   100 ticks 
        5 HZ    =   500 ticks
Therefore:
            
        1 tick  =    10 milliseconds 
      500 ticks =  5000 milliseconds (or 5 seconds) 

So 5 HZ means that CALC_LOAD is called every 5 seconds.

Don't confuse this period with the reporting periods {1-, 5-, 15-} minutes.


LA Calculations

CALC_LOAD is a C macro defined in this code fragment:
        58 extern unsigned long avenrun[ ];        /* Load averages */
        59 
        60 #define FSHIFT          11              /* nr of bits of precision */
        61 #define FIXED_1         (1<<FSHIFT)     /* 1.0 as fixed-point */
        62 #define LOAD_FREQ       (5*HZ)          /* 5 sec intervals */
        63 #define EXP_1           1884            /* 1/exp(5sec/1min) as fixed-point */
        64 #define EXP_5           2014            /* 1/exp(5sec/5min) */
        65 #define EXP_15          2037            /* 1/exp(5sec/15min) */
        66 
        67 #define CALC_LOAD(load,exp,n) \
        68         load *= exp; \
        69         load += n*(FIXED_1-exp); \
        70         load >>= FSHIFT;

There are two points of interest here:

  1. What does CALC_LOAD actually do?
  2. What are the magic numbers: 1884, 2014, 2037?


Fixed Point Factors

Use 1-minute sampling as example. Conversion of exp(5/60) into base-2 with 11 bits of precision can be calculated as:
e5 / 60 ? e5 / 60
211
But EXP_R represents the inverse function exp(  - 5/60R).

Calculate magic numbers directly from the formula:

EXP_R = 211
2 [(5 log2(e))/ 60R]
where R = {1-, 5-, 15-} minute reporting periods.


Magic Numbers

Magic numbers for 5-second sampling rate.
R
EXP_R
Rnd
1
1884.25
1884
5
2014.15
2014
15
2036.65
2037
which agree with the kernel comments ...
        63 #define EXP_1           1884            /* 1/exp(5sec/1min)  */
        64 #define EXP_5           2014            /* 1/exp(5sec/5min)  */
        65 #define EXP_15          2037            /* 1/exp(5sec/15min) */

If the sampling rate was decreased to 2 second intervals...

R
EXP_R
Rnd
1
1980.86
1981
5
2034.39
2034
15
2043.45
2043


What does CALC_LOAD do?

Consider the 1-minute CALC_LOAD function:
        67 #define CALC_LOAD(load,exp,n) \
        68         load *= exp; \
        69         load += n*(FIXED_1-exp); \
It's the fixed-point arithmetic version of:
load(t) = load(t-1)  e-5/60R  +  n(t)  (1 - e-5/60R)
(1)
where n(t) is number of active processes.


Special Case: n(t) = 0

Subsituting into eqn.( 1) ...
load(t) = load(t-1) e-5t/60R
(2)

LAFall.gif
  
Eqn.( 2) represents exponential decay of the type we saw in the experiments after 2100 seconds.


Special Case: n(t) = 2


Here, the second term dominates in eqn.( 2):
load(t) = 2  load(t-1) (1 - e-5t/60R)
(3)

LARise.gif

  
Eqn.( 3) is monotonically increasing. Decay constant t RC_1 = 1 minute. Rise Time ? 5 t RC_1 = 5 minutes (300 seconds).


Exponential Smoothing/Filtering


A general purpose way for prepping highly variable data.
Available in tools like EXCEL, R/S +, Mathematica.

General form of smoothed data is:



Y(t)
smoothed 
= Y(t-1) +

a
damping 
鼢r>덼/font>

X(t)
raw 
- Y(t-1) ??ont>
(4)

By comparison the LA form is:

load(t) = load(t-1) + EXP_R  [ n(t) - load(t-1) ]
(5)
Eqn.( 5) is equivalent to ( 4) if EXP_R = 1 - a.


Relation to Other Averages


So, EXP_R plays the role of a damping factor in the UNIX LA.

Moving Average (MA) ? Arithmetic average with lag-k (see shortly).

Load Average(LA) ? Exponentially-damped MA (Exp-MA)

EXP_R
aR (damping)
1 - aR
EXP_1
0.0800 ( ? 8%)
0.9200
EXP_5
0.0165 ( ? 2%)
0.9835
EXP_15
0.0055 ( ? 1%)
0.9945

where a = 1 - exp(-5/60R).


Steady-State Averages


LAdaily.gif
  
Look at load over a long time (t ? ?) and break the time series into set of columns.
  • Dt ? column width
  • Q(Dt) x Dt ? sub-area
  • 弯font >Q(Dt) x Dt ? total area

The time-averaged queue length: [(弯font >Q(Dt) x Dt)/ T] ? Q


Model of Run-Queue


Steady-state averages:
  • N: running processes
  • Z: sleeping processes
  • X: thoughput
  • D: CPU service time (in Ticks)
  • R: total execution time

RunQueue.jpg
R
=
N
X
- S  (Response  time)
Q
=
X  R  (Little?s  law)

This is the kind of model I used in my previous LUV talk (July 11,2000) in which I analyzed the average performance metrics associated with a fair-share scheduler.

The same kind of averages are used in my performance analyzer tool called Pretty Damn Quick.


Hyper-growth Website Planning


The problem:
  1. What is the growth rate?
  2. Forecast back-end capacity requirements

Published in: Performance Engineering: State of the Art and Current Trends, Springer Lecture Notes in Computer Science, 2001.

Download a copy from www.perfdynamics.com/papers.html


The General Approach


  • Sample time series data
  • Apply MA (or Exp-MA) to remove variance effects
  • Forecast using nonlinear regression
  • Scalability projections (See Refs. 1 & 2)


Sample Time Series


Total CPU utlilization on back-end server (E10K).

eBayDaily.gif

  
Data was collected using SE Toolkit/Percolator. Similar to OCRAlator.


Apply Moving Averages


eBayMAs.gif

Projected Growth Rate


eBayGrowth.gif
  
Doubling time ? 6 months!

Week 20 was Y2K.


Quiz Solutions
lababes.gif


Here are the solutions to the quiz given earlier.

Time Series This is the original time series during the 300 minutes in which the samples were collected.

laRawData.gif

  
Load Averages A plot of the load averages over 300 minutes.

laSeries.gif

  
An Easier Way?
Just reverse the time axis. As described in the Visual Hints section of the quiz, the 3 dots correspond to the 3 numeric LA values and the y-axis shows the load values. But here, the x-axis shows a range of time between -15 and 0 minutes. The left-most point now represents the 15-minute load average, the middle point represents the 5-minute load average and the right-most the 1-minute load average. This representation more closely represents the trend in time.   

trip-all.gif

  1. Random Samples
    1. Sample A: Decreasing
    2. Sample B: Stationary
    3. Sample C: Increasing

  2. Sequential Samples
    1. Maximum: 8:50 am
    2. Minimum: 1:00 pm


Further Reading

  1. N. J. Gunther, The Practical Performance Analyst, Print-On-Demand, iUniverse.com, Inc., Lincoln, Nebraska, 2000
  2. N. J. Gunther, Performance and Scalability Models for a Hypergrowth e-Commerce Web Site, Performance Engineering: State of the Art and Current Trends, R. Dumke and C. Rautenstrauch and A. Schmietendorf and A. Scholz, # 2047, 267-282, Springer-Verlag, Heidelberg, 2001
  3. J. Peek and T. O'Reilly and M. Loukides, UNIX Power Tools, 2nd edn., O'Reilly & Assoc. Inc., Sebastopol, California, 1997
  4. D. P. Bovet and M. Cesati, Understanding the Linux Kernel, O'Reilly & Assoc. Inc., Sebastopol, California, 2001
  5. A. Cockcroft and R. Pettit, Sun Performance and Tuning, 2nd., SunSoft Press, Mountain View, California, 1998


Want to Know More?

Guerrilla Capacity Planning
kong.gif
May, August 2003

Guerrilla Capacity Tools
kongtools.gif
November 2003

Then ... Go forth and Kong-ka! smiley.gif


Footnotes:

1 Copyright © 2002 - 2003 Performance Dynamics Company. All Rights Reserved.

2 Thanks to Mirko Fluher for letting me use pax.apana.org.au


File translated from TEX by TTH, version 2.25.
On 21 Feb 2003, 11:51.

<think>好的,用户现在想了解如何在Linux系统中查看Load Average及其含义。之前用户提到了RTMGRP_IPV4_IFADDR,但这次的问题是关于系统负载的,我需要根据提供的引用内容来回答。 首先,我需要确认用户的需求。用户可能已经知道一些基本命令,比如uptime,但需要更详细的解释和方法。根据引用内容,平均负载指的是系统在1分钟、5分钟、15分钟内的平均活跃进程数,包括运行和等待CPU的进程。这点在引用2、3、4里都有提到。 接下来,用户可能想知道如何查看这些指标。引用1和引用2提到了使用uptime命令,而top和w命令也能显示。需要把这些方法列出来,并给出示例,比如直接运行uptime,或者使用top后按q退出。 然后,关于如何解读这些数值。引用3和4指出需要结合CPU核心数来判断负载是否过高。比如,4核的CPU,如果1分钟负载是4,说明刚好满负荷。超过这个值可能意味着有进程在等待。需要解释不同时间段的负载趋势,比如最近1分钟突然升高,但5分钟和15分钟较低,可能是临时任务。 用户可能还会关心负载高的原因,引用5提到了计算方式,指数衰减移动平均。这可能对高级用户有用,但需要简化解释。比如,内核使用公式计算,最近时间权重更大。 另外,用户提供的引用中有提到查看逻辑核数和活跃进程数的方法,比如cat /proc/cpuinfo和ps命令。这些可以作为补充内容,帮助用户进一步分析负载来源。 需要注意,用户可能混淆了CPU使用率和负载,引用3明确说明负载是进程数,不是使用率。需要强调两者的区别,避免误解。 最后,生成相关问题,帮助用户深入理解,比如如何判断负载是否过高,负载与CPU核心数的关系,以及排查高负载的方法。这些问题应基于引用中的知识点,确保答案的连贯性和参考来源的正确标注。</think>### Linux Load Average 详解与查看方法 --- #### 一、核心定义与含义 Linux **Load Average(平均负载)** 表示系统在 **1分钟、5分钟、15分钟** 内**可运行状态和不可中断状态进程的平均数量**[^2][^3]。具体包含两类进程: 1. **运行队列中的进程**(正在使用或等待CPU) 2. **不可中断的I/O等待进程**(如等待磁盘读写) **关键特性**: - 与CPU核心数直接相关:若4核CPU的负载为4,表示满负荷运行[^4] - 反映系统整体压力:负载值越高,系统越繁忙 - **不是CPU使用率**:100%使用率的单核CPU,负载可能为1.0,而I/O密集型任务可能导致负载远高于CPU核心数[^3] --- #### 二、查看方法(终端命令) ##### 1. `uptime` 命令(最直接) ```bash $ uptime 15:30:01 up 2 days, 3:15, 2 users, load average: 0.68, 0.42, 0.31 ``` - **输出解读**: `load average: 1分钟负载, 5分钟负载, 15分钟负载` 示例中: - 最近1分钟平均0.68个活跃进程 - 最近5分钟平均0.42个活跃进程 - 最近15分钟平均0.31个活跃进程[^1] ##### 2. `top` 命令(动态监控) ```bash $ top top - 15:31:05 up 2 days, 3:16, 2 users, load average: 0.55, 0.40, 0.30 ``` - **按`q`键退出监控界面** ##### 3. `w` 命令(显示用户与负载) ```bash $ w 15:32:01 up 2 days, 3:17, 2 users, load average: 0.60, 0.38, 0.29 ``` ##### 4. `/proc/loadavg` 文件(原始数据) ```bash $ cat /proc/loadavg 0.68 0.42 0.31 2/500 12345 ``` - **字段说明**: `1分钟负载 5分钟负载 15分钟负载 活跃进程/总进程 最近创建的进程PID` --- #### 三、健康状态判断原则 | 核心数 | 1分钟负载 | 状态判断 | 建议操作 | |--------|-----------|----------|----------| | 4核 | ≤4.0 | 正常 | 无需处理 | | 4核 | 4.0-8.0 | 轻度过载 | 检查进程 | | 4核 | >8.0 | 严重过载 | 立即排查 | **场景分析**: - **短期尖峰**:1分钟负载突增(如示例中的1.03),但5/15分钟负载较低,可能是临时任务[^1] - **持续高负载**:15分钟负载持续高于核心数,需检查CPU密集型或I/O阻塞进程[^3] --- #### 四、高级分析方法 ##### 1. 获取逻辑CPU核心数 ```bash $ grep 'processor' /proc/cpuinfo | wc -l 4 # 表示4核CPU ``` ##### 2. 查看实时活跃进程 ```bash $ ps -aux | awk 'NR>1 && ($8=="R" || $8=="D")' | wc -l 3 # 当前有3个活跃进程(R=运行中,D=不可中断) ``` ##### 3. 负载计算原理(内核级) Linux使用**指数衰减移动平均算法**计算负载[^5]: $$ \text{newload} = \text{load} \times e^{-\alpha} + \text{active} \times (1 - e^{-\alpha}) $$ 其中 $\alpha = 1/(5 \times 60 \times HZ)$,HZ为系统时钟频率(通常100或250--- #### 五、典型案例排查步骤 1. **`top`/`htop`**:定位高CPU/高I/O进程 2. **`iostat -x 1`**:检查磁盘I/O等待(%util >70%表示磁盘瓶颈) 3. **`vmstat 1`**:观察上下文切换(cs列)和中断次数(in列) 4. **`pidstat -w 1`**:监控进程级上下文切换 --- ### 相关问题 1. **如何判断负载是否过高?** - 对比负载值与逻辑CPU核心数,若持续超过核心数则需排查[^4] 2. **负载高但CPU使用率低是什么原因?** - 可能由I/O等待(磁盘/网络阻塞)或大量不可中断进程导致 3. **如何查看历史负载记录?** - 使用 `sar -q` 命令(需安装sysstat包) 4. **容器环境中的负载计算有何不同?** - 在容器中,负载反映的是宿主机的全局资源压力,而非容器本身限制[^2]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值