Isolating Linux High System Load

本文介绍了解决服务器高负载和无响应问题的方法。通过使用uptime、dmesg、vmstat等工具,可以诊断CPU和磁盘利用率问题,并通过iostat、iotop和top等工具进一步定位I/O活动频繁的源头。

by tummy.com

There are typically two reasons that a server will show high load and become unresponsive: CPU and disc utilization. On a rare occasion it's something like a hardware error causing a disc to become unresponsive. There are some great tools for tracking and isolating these issues, as long as you know how to interpret the results.

Uptime

The first thing I'll do when I can get logged into a non-responsive system is do an "uptime". If the load is high, I know I need to start digging into things with the other tools. Uptime gives you 3 numbers which indicate the 1, 5, and 15 minute load averages. From this you can tell if the load is trending up, neutral, or going down:

guin:~$ uptime
 13:26:32 up 1 day, 16:52, 21 users,  load average: 0.00, 0.14, 0.15
guin:~$

On my laptop, the load is fairly low, but it is trending down (1 minute average of 0.00, 5 minute of 0.14, and over 15 minutes it's 0.15).

database:~ # uptime
 12:29pm  up 1 day 13:29,  1 user,  load average: 0.84, 0.82, 0.80
database:~ #

On this database server the load is somewhat low (it's a quad CPU box, so I wouldn't consider it saturated until it was around 4).

Dmesg

It's also useful to look at the bottom of the "dmesg" output. Usually it isn't particularly revealing, but in the case of hardware errors or the out of memory killer it can very quickly reveal a problem.

Vmstat

Next I will often run "vmstat 1", which prints out statistics every second on the system utilization. The first line is the average since the system was last booted:

denver-database:~ # vmstat 1
procs ---------memory---------- --swap-- --io--- -system-- -----cpu-----
 r  b swpd   free   buff  cache  si  so  bi  bo   in   cs  us sy id wa st
 0  0  116 158096 259308 3083748   0   0  47  39   30   58 11  8 76  5  0
 2  0  116 158220 259308 3083748   0   0   0   0 1706 4899 22 14 64  0  0
 1  0  116 158220 259308 3083748   0   0   0 276 1435 1490  4  2 93  0  0
 0  0  116 158220 259308 3083748   0   0   0   0 1502 1569  5  3 92  0  0
 0  0  116 158220 259308 3083748   0   0   0 892 1394 1529  2  1 97  0  0
 1  0  116 158592 259308 3083748   0   0   0 216 1702 1825  8  7 84  1  0
 0  0  116 158344 259308 3083748   0   0   0 368 1465 1461  8  7 84  0  0
 0  0  116 158344 259308 3083748   0   0   0 940 1992 2115  2  2 95  0  0
 0  0  116 158344 259308 3083748   0   0   0 240 1906 1982  6  7 87  0  0

The first thing I'll look at here is the "wa" column; the mount of CPU time spent waiting. If this is high you almost certainly have something hitting the disc hard.

If the "wa" is high, the next thing I'd look at is the "swap" columns "si" and "so". If these are much above 0 on a regular basis, it probably means you're out of memory and the system is swapping. Since RAM is around a million times faster than a hard drive (10ns instead of 10ms), swapping much can cause the system to really grind to a halt. Note however that some swapping, particularly swapping out, is normal.

Next I'd look at the "id" column under "cpu" for the amount of idle CPU time. If this is around 0, it means the CPU is heavily used. If it is, the "sy" and "us" columns tell us how much time is being used by the kernel and user-space processes.

If CPU "sy" time is high, this can often indicate that there are some large directories (say a user's "spam" mail directory) with hundreds of thousand or millions of entries, or other large directory trees. Another common cause of high "sy" CPU time is the system firewall: iptables. There are other causes of course but these seem to be the primary ones.

If CPU "us" is high, that's easy to track down with "top".

ps awwlx --sort=vsz

If there is swapping going on I like to look at the big processes via "ps awwlx --sort=vsz". This shows processes sorted by virtual sizes (which does include shared libraries, but also counts blocks swapped out to disc).

Iostat

For systems where there is a lot of I/O activity (shown via the "bi" and "bo" being high, but "si" and "so" being low), iostat can tell you more about what hard drives the activity is happening on, and what the utilization is. Normally I will run "iostat -x 5" which causes it to print out updated stats every 5 seconds:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.64    0.00    3.95    0.30    0.00   90.11

Device: rrqm/s wrqm/s   r/s   w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda       0.00   9.60  0.60  2.40   6.40  97.60    34.67     0.01  4.80  4.80  1.44

I'll first look at the "%util" column, if it's approaching 100% then that device is being hit hard. In this case we only have one device, so I can't use this to isolate where the heavy activity might be happening, but if the database were on it's own partition that could help track it down.

"await" is a very useful column, it tells us how long the device takes to service a request. If this gets high, it probably indicates saturation.

Other information iostat gives can tell us if the activity is read-oriented or writes, and whether they are small or large writes (based on the sec/s sectors per second rate and the number of read/writes per second).

Iotop

This requires a very recent kernel (2.6.20 or newer), so this isn't something I tend to run very often: most of the systems I maintain are enterprise distros, so they have older kernels. RHEL/CentOS 3/4/5 are too old, Ubuntu Hardy doesn't have iotop, but Lucid does support it.

iotop is like top but it will show processes that are doing heavy I/O. However, often this may be a kernel process so you still may not be able to tell exactly what process is causing the I/O load. It's much better than what we had in the past though.

Top

In the case of high user CPU time, "top" is a great tool for telling you what is using the most CPU.

Munin

Munin is a great tool that tracks long-term system performance trends. However, it's not something you can start using when you have a performance problem. It's the sort of thing you should set up on all your systems so that you can build up the historic usage and have it available when you need it.

It will give you extensive stats about CPU, disc, RAM, network, and other resources, and allow you to see trends to determine an upgrade will be needed in the coming months, rather than that you needed to do one a few months ago. :-)

Conclusions

When performance problems hit, there are many great tools for helping to isolate and resolve them. Using these techniques I've been able to quickly and accurately identify and mitigate performance issues.

STM32电机库无感代码注释无传感器版本龙贝格观测三电阻双AD采样前馈控制弱磁控制斜坡启动内容概要:本文档为一份关于STM32电机控制的无传感器版本代码注释资源,聚焦于龙贝格观测器在永磁同步电机(PMSM)无感控制中的应用。内容涵盖三电阻双通道AD采样技术、前馈控制、弱磁控制及斜坡启动等关键控制策略的实现方法,旨在通过详细的代码解析帮助开发者深入理解基于STM32平台的高性能电机控制算法设计与工程实现。文档适用于从事电机控制开发的技术人员,重点解析了无位置传感器控制下的转子初始定位、速度估算与系统稳定性优化等问题。; 适合人群:具备一定嵌入式开发基础,熟悉STM32平台及电机控制原理的工程师或研究人员,尤其适合从事无感FOC开发的中高级技术人员。; 使用场景及目标:①掌握龙贝格观测器在PMSM无感控制中的建模与实现;②理解三电阻采样与双AD同步采集的硬件匹配与软件处理机制;③实现前馈补偿提升动态响应、弱磁扩速控制策略以及平稳斜坡启动过程;④为实际项目中调试和优化无感FOC系统提供代码参考和技术支持; 阅读建议:建议结合STM32电机控制硬件平台进行代码对照阅读与实验验证,重点关注观测器设计、电流采样校准、PI参数整定及各控制模块之间的协同逻辑,建议配合示波器进行信号观测以加深对控制时序与性能表现的理解。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值