背景
本文将介绍Mellanox网卡在Linux开源代码中实现的拥塞控制相关的参数
位置
cat /sys/kernel/debug/mlx5/0000:41:00.0/cc_params
效果:
参数作用
直接根据输出打印的值介绍常用的一些值:
(base) [root@one cc_params]# for f in `ls`; do echo -n "$f "; cat $f; done
np_cnp_dscp 48
CNP的优先级查分服务码,网卡会根据dscp对应prio放到不同队列,交换机也会根据该值决定走什么队列。
np_cnp_prio 6
cnp对应的优先级队列走队列6
np_cnp_prio_mode 1
np_min_time_between_cnps 4
是发送CNP的最小间隔,比如这里是最短4ms发送一个。收到ECN打标的包之后np端主动构造报文。
Minimum time between sending CNPs from the port. Unit = microseconds.
Default = 0 (no min wait time; generated based on incoming ECN marked packets).
rp_ai_rate 5
CC算法AI(Additive increase 加性增)的增速的速率。是增速几个阶段中的AI阶段。对立面有乘性减。乘的加速度更快。
rp_byte_reset 32767
rp_clamp_tgt_rate 0
rp_clamp_tgt_rate_ati 1
rp_dce_tcp_g 1019
rp_dce_tcp_rtt 1
rp_gd 11
rp_hai_rate 50
CC算法HAI(Hyper Additive increase 超级加性增)的速率。是rp端根据反馈进行增速的。
rp_initial_alpha_value 1023
rp_max_rate 0
Maximum rate at which reaction point node can transmit.
Once this limit is reached, RP is no longer rate limited.
Unit = Mbits/sec
Default = 0 (full speed)
rp_min_dec_fac 50
rp_min_rate 1
rp_rate_reduce_monitor_period 4
rp_rate_to_set_on_first_cnp 0
rp_threshold 1
rp_time_reset 300
参数官方说明:
This patch adds debug control parameters for congestion control which
can be read or written through debugfs. They are for reaction point and
notification point nodes.
These control parameters are as below:
+------------------------------+-----------------------------------------+
| Name | Description |
|------------------------------+-----------------------------------------|
|rp_clamp_tgt_rate | When set target rate is updated to |
| | current rate |
|------------------------------+-----------------------------------------|
|rp_clamp_tgt_rate_ati | When set update target rate based on |
| | timer as well |
|------------------------------+-----------------------------------------|
|rp_time_reset | time between rate increase if no |
| | CNP is received unit in usec |
|------------------------------+-----------------------------------------|
|rp_byte_reset | Number of bytes between rate inease if |
| | no CNP is received |
|------------------------------+-----------------------------------------|
|rp_threshold | Threshold for reaction point rate |
| | control |
|------------------------------+-----------------------------------------|
|rp_ai_rate | Rate for target rate, unit in Mbps |
|------------------------------+-----------------------------------------|
|rp_hai_rate | Rate for hyper increase state |
| | unit in Mbps |
|------------------------------+-----------------------------------------|
|rp_min_dec_fac | Minimum factor by which the current |
| | transmit rate can be changed when |
| | processing a CNP, unit is percerntage |
|------------------------------+-----------------------------------------|
|rp_min_rate | Minimum value for rate limit, |
| | unit in Mbps |
|------------------------------+-----------------------------------------|
|rp_rate_to_set_on_first_cnp | Rate that is set when first CNP is |
| | received, unit is Mbps |
|------------------------------+-----------------------------------------|
|rp_dce_tcp_g | Used to calculate alpha |
|------------------------------+-----------------------------------------|
|rp_dce_tcp_rtt | Time between updates of alpha value, |
| | unit is usec |
|------------------------------+-----------------------------------------|
|rp_rate_reduce_monitor_period | Minimum time between consecutive rate |
| | reductions |
|------------------------------+-----------------------------------------|
|rp_initial_alpha_value | Initial value of alpha |
|------------------------------+-----------------------------------------|
|rp_gd | When CNP is received, flow rate is |
| | reduced based on gd, rp_gd is given as |
| | log2(rp_gd) |
|------------------------------+-----------------------------------------|
|np_cnp_dscp | dscp code point for generated cnp |
|------------------------------+-----------------------------------------|
|np_cnp_prio_mode | 802.1p priority for generated cnp |
|------------------------------+-----------------------------------------|
|np_cnp_prio | cnp priority mode |
+------------------------------+-----------------------------------------+
Linux上设置和获取的机制
Mellanox在Linux内核中实现在cong.c中,cong应该是congestion(拥塞)的缩写。
Linux提供kernel的debugfs,提供给用户读取以及配置参数。配置的逻辑是通过WQE的方式约定数据格式下发给网卡。具体可以参考内核代码中mlx5_ifc.h中针对FW侧每种操作对象的格式封装。比如设置CNP报文的dscp优先级cnp_dscp
就是:
mlnx驱动使用MLX5_SET
获取数据结构偏移,从而得到对应的数据结构:struct mlx5_ifc_cong_control_r_roce_ecn_np_bits
,也就是前面提到的FW侧封装的对象。
最后设置的结构就是:
后记
网卡上的拥塞控制参数,通过Linux debug fs灵活的提供给用户,比使用mlxreg命令更加方便和直接以及操作面更广泛。无论用mlxreg、mlxconfig等命令,最终都是设置到网卡上。把握住主线,就能不被命令困惑。
参考
更多参考兄弟篇:【微知】RDMA和拥塞控制领域中的NP和RP是什么?
☆☆☆☆☆官网说明
https://patchwork.kernel.org/project/linux-rdma/patch/20200227125246.99472-1-leon@kernel.org/
提交这些参数的patch: https://patchwork.kernel.org/project/linux-rdma/patch/20170530070515.6836-1-leon@kernel.org/