Linux 性能诊断 perf使用指南

perf是Linux 2.6+内核中的一个性能诊断工具,利用Linux的trace特性进行实时跟踪和统计。它支持多种事件跟踪,包括内核、用户空间、块设备、网络等。perf工作原理涉及符号表、stack traces和动态跟踪。文章介绍了perf的使用,如perf list、perf record、perf report、perf annotate等,并提到了perf在没有适当配置(如CONFIG_KALLSYMS、CONFIG_FRAME_POINTER等)时可能遇到的问题。此外,文章还讨论了如何绘制perf火焰图和热力图以帮助分析性能瓶颈。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

标签

Linux , perf , 性能诊断 , stap , systemtap , strace , dtrace , dwarf , profiler , perf_events


背景

Linux在服务端已占据非常大的比例,很多业务很多服务都跑在Linux上面。

软件运行在Linux下,软件本身、以及Linux系统的性能诊断也成为热门的话题。

例如,你要如何回答这些问题

Why is the kernel on-CPU so much? What code-paths?  

Which code-paths are causing CPU level 2 cache misses?  

Are the CPUs stalled on memory I/O?  

Which code-paths are allocating memory, and how much?  

What is triggering TCP retransmits?  

Is a certain kernel function being called, and how often?  

What reasons are threads leaving the CPU?  

又或者你是一名DBA或者开发人员,想知道数据库在跑某些benchmark时,性能瓶颈在哪里,是IO,是等待,还是网络,代码瓶颈在哪里?

在Linux下诊断的工具比较多,比如systemtap, dtrace, perf。

本文将介绍一下perf的用法,网上很多叫法如perf_events , perf profiler , Performance Counters for Linux。叫法不同,都指perf。

什么是perf

perf是Linux 2.6+内核中的一个工具,在内核源码包中的位置 tools/perf。

perf利用Linux的trace特性,可以用于实时跟踪,统计event计数(perf stat);或者使用采样(perf record),报告(perf report|script|annotate)的使用方式进行诊断。

perf命令行接口并不能利用所有的Linux trace特性,有些trace需要通过ftrace接口得到。

参考 https://github.com/brendangregg/perf-tools

perf工作原理

pic

这张图大致列出了perf支持的跟踪事件,从kernerl到user space,支持块设备、网络、CPU、文件系统、内存等,同时还支持系统调用,用户库的事件跟踪。

你可以使用perf list输出当前内核perf 支持的预置events

perf list

List of pre-defined events (to be used in -e):

  ref-cycles                                         [Hardware event]

  alignment-faults                                   [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

.....略.......

  writeback:writeback_pages_written                  [Tracepoint event]
  writeback:writeback_queue                          [Tracepoint event]
  writeback:writeback_task_start                     [Tracepoint event]
  writeback:writeback_task_stop                      [Tracepoint event]

perf background

我们看到perf支持这么多的事件和trace,它依赖了很多的接口来干这件事情。

1. Symbols

没有符号表,无法将内存地址翻译成函数和变量名。

例如,无符号表的跟踪显示如下

    57.14%     sshd  libc-2.15.so        [.] connect           
               |
               --- connect
                  |          
                  |--25.00%-- 0x7ff3c1cddf29
                  |          
                  |--25.00%-- 0x7ff3bfe82761
                  |          0x7ff3bfe82b7c
                  |          
                  |--25.00%-- 0x7ff3bfe82dfc
                   --25.00%-- [...]

有符号表的跟踪显示如下

    57.14%     sshd  libc-2.15.so        [.] __GI___connect_internal
               |
               --- __GI___connect_internal
                  |          
                  |--25.00%-- add_one_listen_addr.isra.0
                  |          
                  |--25.00%-- __nscd_get_mapping
                  |          __nscd_get_map_ref
                  |          
                  |--25.00%-- __nscd_open_socket
                   --25.00%-- [...]

如何安装符号表?

对于内核代码的符号表,在编译内核时,使用CONFIG_KALLSYMS=y。 检查如下

# cat /boot/config-`uname -r` |grep CONFIG_KALLSYMS
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y

对于用户安装软件的符号表,如果是yum安装的,可以安装对应的debuginfo包。

如果是用户自己编译的,例如使用GCC编译时加上-g选项。

2. perf annotate

perf annotate can generate sourcecode level information if the application is compiled with -ggdb.

3. Stack Traces (使用perf record -g收集stack traces)

要跟踪完整的stack,编译时需要注意几个东西。

Always compile with frame pointers. 
Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. 

Without them, you may see incomplete stacks from perf_events, like seen in the earlier sshd symbols example. 

There are two ways to fix this: 
either using dwarf data to unwind the stack, or returning the frame pointers.

3.1 编译perf时包含libunwind和-g dwarf,需要3.9以上的内核版本。

Since about the 3.9 kernel, perf_events has supported a workaround for missing frame pointers in user-level stacks: libunwind, which uses dwarf. 

This can be enabled using "-g dwarf".

3.2 有些编译优化项会忽略frame pointer,所以编译软件时必须指定 -fno-omit-frame-pointer ,才能跟踪完整的stack trace.

The earlier sshd example was a default build of OpenSSH, which uses compiler optimizations (-O2), which in this case has omitted the frame pointer. 

Here's how it looks after recompiling OpenSSH with -fno-omit-frame-pointer

3.3 编译内核时包含 CONFIG_FRAME_POINTER=y

总结一下,要愉快的跟踪更完备的信息,就要在编译软件时打开符号表的支持(gcc -g),开启annotate的支持(gcc -ggdb),以及Stack trace的支持(gcc -fno-omit-frame-pointer)。

perf pre-defined event说明

Hardware [Cache] Events: 
  CPU相关计数器
  CPU周期、指令重试,内存间隔周期、L2CACHE miss等
  These instrument low-level processor activity based on CPU performance counters. 
  For example, CPU cycles, instructions retired, memory stall cycles, level 2 cache misses, etc. 
  Some will be listed as Hardware Cache Events.

Software Events: 
  内核相关计数器
  These are low level events based on kernel counters. 
  For example, CPU migrations, minor faults, major faults, etc.

Tracepoint Events: 
  内核ftrace框架相关,例如系统调用,TCP事件,文件系统IO事件,块设备事件等。  
  根据LIBRARY归类。如sock表示socket事件。  
  This are kernel-level events based on the ftrace framework. These tracepoints are placed in interesting and logical locations of the kernel, so that higher-level behavior can be easily traced. 
  For example, system calls, TCP events, file system I/O, disk I/O, etc. 
  These are grouped into libraries of tracepoints; 
    eg, "sock:" for socket events, "sched:" for CPU scheduler events.

Dynamic Tracing: 
  动态跟踪,可以在代码中的任何位置创建事件跟踪节点。很好很强大。  
  内核跟踪使用kprobe,user-level跟踪使用uprobe。  
  Software can be dynamically instrumented, creating events in any location. 
  For kernel software, this uses the kprobes framework. 
  For user-level software, uprobes.

Timed Profiling: 
  采样频度,按指定频率采样,被用于perf record。  
  Snapshots can be collected at an arbitrary frequency, using perf record -FHz. 
  This is commonly used for CPU usage profiling, and works by creating custom timed interrupt events.

了解了perf event后,我们可以更精细的,有针对性的对事件进行跟踪,采样,报告。

当然,你也可以不指定事件,全面采样。

build perf

例如centos你可以使用yum安装,也可以使用源码安装。

perf在内核源码的tools/perf中,所以下载与你的内核大版本一致的内核源码即可

uname -a 

wget https://cdn.kernel.org/pub/linux/kernel/v3.x/linux-3.10.104.tar.xz

tar -xvf linux-3.10.104.tar.xz

cd linux-3.10.104/tools/perf/

安装依赖库,有一个小窍门可以找到依赖的库

$cat Makefile |grep found

                msg := $(warning No libelf found, disables 'probe' tool, please install elfutils-libelf-devel/libelf-dev);
                msg := $(error No gnu/libc-version.h found, please install glibc-dev[el]/glibc-static);
                msg := $(warning No libdw.h found or old libdw.h found or elfutils is older than 0.138, disables dwarf support. Please install new elfutils-devel/libdw-dev);
        msg := $(warning No libunwind found, disabling post unwind support. Please install libunwind-dev[el] >= 0.99);
                msg := $(warning No libaudit.h found, disables 'trace' tool, please install audit-libs-devel or libaudit-dev);
                msg := $(warning slang not found, disables TUI support. Please install slang-devel or libslang-dev);
                msg := $(warning GTK2 not found, disables GTK2 support. Please install gtk2-devel or libgtk2.0-dev);
  $(if $(1),$(warning No $(1) was found))
                                                msg := $(warning No bfd.h/libbfd found, install binutils-dev[el]/zlib-static to gain symbol demangling)
                msg := $(warning No numa.h found, disables 'perf bench numa mem' benchmark, please install numa-libs-devel or libnuma-dev);

通常依赖 gcc make bison flex elfutils libelf-dev libdw-dev libaudit-dev python-dev binutils-dev

perf依赖的kernel宏

并不是每个开关都需要,但是有些没有就不方便或者功能缺失,例如没有打开符号表的话,看到的是一堆内存地址。

# for perf_events:
CONFIG_PERF_EVENTS=y
# for stack traces:
CONFIG_FRAME_POINTER=y
# kernel symbols:
CONFIG_KALLSYMS=y
# tracepoints:
CONFIG_TRACEPOINTS=y
# kernel function trace:
CONFIG_FTRACE=y
# kernel-level dynamic tracing:
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
# user-level dynamic tracing:
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y
# full kernel debug info:
CONFIG_DEBUG_INFO=y
# kernel lock tracing:
CONFIG_LOCKDEP=y
# kernel lock tracing:
CONFIG_LOCK_STAT=y
# kernel dynamic tracepoint variables:
CONFIG_DEBUG_INFO=y

一些开关的用途介绍

1. Kernel-level symbols are in the kernel debuginfo package, or when the kernel is compiled with CONFIG_KALLSYMS.

2. The kernel stack traces are incomplete. Now a similar profile with CONFIG_FRAME_POINTER=y

3. 当我们使用perf record [stack traces (-g)]时,可以跟踪stack,但是如果内核编译时没有指定CONFIG_FRAME_POINTER=y,perf report时就会看到缺失的信息。

不包含CONFIG_FRAME_POINTER=y时

    99.97%  swapper  [kernel.kallsyms]  [k] default_idle
            |
            --- default_idle

     0.03%     sshd  [kernel.kallsyms]  [k] iowrite16   
               |
               --- iowrite16
                   __write_nocancel
                   (nil)

包含CONFIG_FRAME_POINTER=y时 (Much better -- the entire path from the write() syscall (__write_nocancel) to iowrite16() can be seen.)

    99.97%  swapper  [kernel.kallsyms]  [k] default_idle
            |
            --- default_idle
                cpu_idle
               |          
               |--87.50%-- start_secondary
               |          
                --12.50%-- rest_init
                          start_kernel
                          x86_64_start_reservations
                          x86_64_start_kernel

     0.03%     sshd  [kernel.kallsyms]  [k] iowrite16
               |
               --- iowrite16
                   vp_notify
                   virtqueue_kick
                   start_xmit
                   dev_hard_start_xmit
                   sch_direct_xmit
                   dev_queue_xmit
                   ip_finish_output
                   ip_output
                   ip_local_out
                   ip_queue_xmit
                   tcp_transmit_skb
                   tcp_write_xmit
                   __tcp_push_pending_frames
                   tcp_sendmsg
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值