前言
Tracepoint Programs是一类典型的eBPF程序。
《Linux Observability with BPF》一书中这样介绍Tracepoint Programs:
This type of program allows you to attach BPF programs to the tracepoint handler provided by the kernel. Tracepoint programs are defined with the type BPF_PROG_TYPE_TRACEPOINT. As you’ll see in Chapter 4, tracepoints are static marks in the kernel’s codebase that allow you to inject arbitrary code for tracing and debugging purposes. They are less flexible than kprobes, because they need to be defined by the kernel beforehand, but they are guaranteed to be stable after their introduction in the kernel. This gives you a much higher level of predictability when you want to debug your system.
简而言之,Trace point程序是一类不如kprobe灵活,但引入内核后保证稳定的程序。所有的trace point点需要由内核事先定义,所有的跟踪点都在/sys/kernel/debug/tracing/events中定义。
接下来,我们使用BCC构建一个Trace Point程序。
BCC程序与简要解析
程序来自bcc的官方例子:examples/tracing/urandomread.py
from __future__ import print_function
from bcc import BPF
# load BPF program
b = BPF(text="""
TRACEPOINT_PROBE(random, urandom_read) {
// args is from /sys/kernel/debug/tracing/events/random/urandom_read/format
bpf_trace_printk("%d\\n", args->got_bits);
return 0;
}
""")
# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "GOTBITS"))
# format output
while 1:
try:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
except ValueError:
continue
print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))
TRACEPOINT_PROBE(random, urandom_read)
: 启动内核跟踪点。根据目录/random/urandom_read输入参数。args->got_bits
: 自动生成的参数,每个event的参数各有不同,见下文。
运行结果示例:
urandomread的参数设置如下:
输入
cat /sys/kernel/debug/tracing/events/random/urandom_read/format
得到:
name: urandom_read
ID: 1239
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int got_bits; offset:8; size:4; signed:1;
field:int pool_left; offset:12; size:4; signed:1;
field:int input_left; offset:16; size:4; signed:1;
print fmt: "got_bits %d nonblocking_pool_entropy_left %d input_entropy_left %d", REC->got_bits, REC->pool_left, REC->input_left
从最后一行可以看到,可以输出go_bits,pool_left,input_left三个参数。
Trace Point 举一反三
由官方例子可以得到构建一个trace point程序的流程:
首先,查看想要跟踪的event的输出格式。此处以/tcp/tcp_receive_reset跟踪点为例:
cat /sys/kernel/debug/tracing/events/tcp/tcp_receive_reset/format
得到:
name: tcp_receive_reset
ID: 1467
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:const void * skaddr; offset:8; size:8; signed:0;
field:__u16 sport; offset:16; size:2; signed:0;
field:__u16 dport; offset:18; size:2; signed:0;
field:__u8 saddr[4]; offset:20; size:4; signed:0;
field:__u8 daddr[4]; offset:24; size:4; signed:0;
field:__u8 saddr_v6[16]; offset:28; size:16; signed:0;
field:__u8 daddr_v6[16]; offset:44; size:16; signed:0;
field:__u64 sock_cookie; offset:64; size:8; signed:0;
print fmt: "sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c sock_cookie=%llx", REC->sport, REC->dport, REC->saddr, REC->daddr, REC->saddr_v6, REC->daddr_v6, REC->sock_cookie
此处选择sport这一参数输出,若想输出多个参数,可以使用PERF结构,本文为了方便仍使用printk方法进行输出。
其次,构建相应的bcc程序:
#!/usr/bin/python
#
# urandomread Example of instrumenting a kernel tracepoint.
# For Linux, uses BCC, BPF. Embedded C.
#
# REQUIRES: Linux 4.7+ (BPF_PROG_TYPE_TRACEPOINT support).
#
# Test by running this, then in another shell, run:
# dd if=/dev/urandom of=/dev/null bs=1k count=5
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
from __future__ import print_function
from bcc import BPF
from bcc.utils import printb
# load BPF program
b = BPF(text="""
TRACEPOINT_PROBE(tcp, tcp_receive_reset) {
// args is from /sys/kernel/debug/tracing/events/tcp/tcp_receive_reset/format
bpf_trace_printk("%u\\n", args->sport);
return 0;
}
""")
# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "sport"))
# format output
while 1:
try:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
except ValueError:
continue
except KeyboardInterrupt:
exit()
printb(b"%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))
最后,运行程序:
成功运行。