RoCE QoS configuration - Priority mapping

本文详细介绍了在RoCEv1和RoCEv2流量中使用L2 PCP和L3 DSCP进行QoS配置的方法,包括通过VPI verbs和RDMA_CM设置优先级,并提供了在Redhat 7.4系统上使用ConnectX-5网卡和MLNX ONYX交换机的具体配置实例。


考虑了很久,为了表达准确一些,还是决定用英文来写。

1. Priority configuration for RoCE

Both VPI verbs and RDMA_CM provide API for RoCE QoS configuration. With mlnx_ofed installed on the system, application can use multiple methods to set priority on RoCE traffic. Details are provided in this section.

1.1. Trust L2 - PCP based QoS

PCP based QoS can be configured for RoCEv1 and RoCEv2 traffic.

1.1.1. With VPI verbs

QPs can be created with specific SL (Service Level, 4 bits, range 0-15) from application, i.e.:

struct ibv_qp_attr *attr;
attr->qp_state = IBV_QPS_RTR;
attr->ah_attr.sl = xxx;
ibv_modify_qp(qp, attr, flags);

Device driver maps SL to PCP field (3 bits) - PCP takes 3 LSB’s of SL, it stands for UP (User Priority), UP is used in VLAN header:

UP = SL & 0x7

UP would be mapped to HW Traffic Class, the mapping from UP to TC can be checked and modified by “mlnx_qos” with option “–prio_tc=LIST”.

The mapping from SL to TC in mode of PCP based QoS:

Service LevelUser PriorityHW Traffic Class
0, 80default 0, configurable with “mlnx_qos --prio_tc”
1, 91default 1, configurable with “mlnx_qos --prio_tc”
2, 102default 2, configurable with “mlnx_qos --prio_tc”
3, 113default 3, configurable with “mlnx_qos --prio_tc”
4, 124default 4, configurable with “mlnx_qos --prio_tc”
5, 135default 5, configurable with “mlnx_qos --prio_tc”
6, 146default 6, configurable with “mlnx_qos --prio_tc”
7, 157default 7, configurable with “mlnx_qos --prio_tc”

“ib_write_bw -S ” can be used to check whether RoCE traffic is transmitted via expected priority.

1.1.2. With RDMA_CM

ToS can be configured on RDMA_CM QPs via API rdma_set_option(), with option RDMA_OPTION_ID_TOS;
Linux command line “cma_roce_tos” can be used to set default ToS.

In kernel, ToS {0, 8, 24, 16} would be mapped to SKB Priority {0, 2, 4, 6}, reference kernel code is in include/net/route.h::rt_tos2priority(u8 tos)

Device driver maps SKB Priority to User Priority, the mapping is defined by user, “ip link set dev [vlan interface] type vlan egress-qos-map [sk_prio2egress_prio mapping]” can be used in Linux command line to set the mapping.

UP would be mapped to HW Traffic Class, “mlnx_qos -i [interface] --prio_tc=LIST” is for user setting.

The mapping from ToS to TC in mode of PCP based QoS:

Type of ServiceSKB PriorityUser PriorityHW Traffic Class
0…7, 32…39, 64…71, 96…103, 128…135, 160…167, 192…199, 224…2310Need user to config with “egress-qos-map”Configurable with “mlnx_qos --prio_tc”
8…15, 40…47, 72…79, 104…111, 136…143, 168…175, 200…207, 232…2392Need user to config with “egress-qos-map”Configurable with “mlnx_qos --prio_tc”
16…23, 48…55, 80…87, 112…119, 144…151, 176…183, 208…215, 240…2474Need user to config with “egress-qos-map”Configurable with “mlnx_qos --prio_tc”
24…31, 56…63, 88…95, 120…127, 152…159, 184…191, 216…223, 248…2556Need user to config with “egress-qos-map”Configurable with “mlnx_qos --prio_tc”

“ib_write_bw -R --tos [ToS value]” can be used to check whether RoCE traffic is transmitted via expected priority.

1.2. Trust L3 - DSCP based QoS

DSCP based QoS can be configured for RoCEv2 traffic, it is not applicable for RoCEv1.

1.2.1. With VPI Verbs

QPs can be created with specific TClass (Traffic Class, 8 bits, range 0-255) from application.
Device driver maps TClass to DSCP field (6 bits) in IP header. DSCP takes 6 MSB’s of Tclass:

DSCP = TClass >> 2

Device driver maps DSCP to User Priority, the mapping can be configured by user with Linux command line “mlnx_qos -i --dscp2prio=DSCP2PRIO”.

UP would be mapped to HW Traffic Class, “mlnx_qos -i [interface] --prio_tc=LIST” is for user to configure the mapping.

The mapping from TClass to TC in mode of DSCP based QoS:

TClassDSCPUser PriorityHW Traffic Class
0…310…7Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
32…638…15Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
64…9516…23Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
96…12724…31Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qo --prio_tcs”
128…15932…39Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
160…19140…47Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
192…22348…55Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
224…25556…63Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”

“ib_write_bw --tclass [TClass value]” can be used to check whether RoCE traffic is transmitted via expected priority.

1.2.2. With RDMA_CM

In this mode, ToS (Type of Service) has range 0-255, and is regarded by device driver as the value of TClass.
ToS can be configured on RDMA_CM QPs via API rdma_set_option(), with option RDMA_OPTION_ID_TOS;
Linux command line “cma_roce_tos” can be used to set default ToS.
The mappings of “ToS(TClass) – DSCP – User Priority – HW Traffic Class” are the same as described in VPI verbs method in mode of DSCP based QoS configuration:

ToSDSCPUser PriorityHW Traffic Class
0…310…7Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
32…638…15Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
64…9516…23Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
96…12724…31Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
128…15932…39Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
160…19140…47Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
192…22348…55Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”
224…25556…63Configurable with “mlnx_qos --dscp2prio”Configurable with “mlnx_qos --prio_tc”

“ib_write_bw -R --tos [ToS value]” can be used to check whether RoCE traffic is transmitted via expected priority.

1.3. Summary

The table is a summary for priority configuration on RoCE traffic as described upon, including:

  • traffic type - RoCEv1 and RoCEv2
  • QoS type - L2 PCP and L3 DSCP
  • Configuration method - VPI verbs and RDMA_CM
  • Verification with ib_write_bw - with options to check if it is the expected configuration
RoCE TrafficQoSConfigurationVerification
RoCEv1L2 PCPVPI verbsSL [0…15]ib_write_bw -S
RDMA_CMToS [0, 8, 24, 16]ib_write_bw -R --tos
RoCEv2L2 PCPVPI verbsSL [0…15]ib_write_bw -S
RDMA_CMToS [0, 8, 24, 16]ib_writte_bw -R --tos
L3 DSCPVPI verbsTClass [0…255]ib_write_bw --tclass
RDMA_CMToS [0…255]ib_write_bw -R --tos

2. Configuration example

This section targets to provide examples for the configuration mentioned above.

2.1. RoCEv2 over Lossless Fabric (ECN+PFC) with Trust L2 QoS

2.1.1. Host configuration

This example provides configuration on host with Redhat 7.4. Dual-port ConnectX-5 is used, and ethernet port ens1f1 is used in the setup.

Enable 8021q Linux kernel module on hosts (for traffic that passes via the kernel, it is not required for RoCE which is kernel bypass)

[root]# modprobe 8021q
[root]# lsmod | grep 8021q
8021q 33208 0
garp 14384 1 8021q
mrp 18542 1 8021q

Set VLAN on hosts

[root]# ip link add link ens1f1 name ens1f1.100 type vlan id 100
[root]# ip link show
23: ens1f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4200 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:ae:11:dd brd ff:ff:ff:ff:ff:ff
24: ens1f1.100@ens1f1: <BROADCAST,MULTICAST> mtu 4200 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:ae:11:dd brd ff:ff:ff:ff:ff:ff

Set IP for VLAN interface on hosts

[root]# ifconfig ens1f1.100 10.0.0.10

Configure PFC on hosts (enable PFC for priority#3)

[root]# mlnx_qos -i ens1f1 --pfc=0,0,0,1,0,0,0,0

Enable ECN on priority 3 (optional, ECN is enabled by default)

[root]# echo 1 > /sys/class/net/ens1f1/ecn/roce_np/enable/3
[root]# echo 1 > /sys/class/net/ens1f1/ecn/roce_rp/enable/3

Set CNP L2 egress priority on 6 (optional, it is the default value)

[root]# echo 6 > /sys/class/net/ens1f1/ecn/roce_np/cnp_802p_prio

Enable ECN on the TCP traffic (optional, only when it is required for kernel TCP traffic)

[root]# sysctl -w net.ipv4.tcp_ecn=1
net.ipv4.tcp_ecn = 1

Set RoCE mode to V1 for RDMA CM traffic

[root]# cma_roce_mode -d mlx5_1 -p 1 -m 2
RoCE v2

Set default ToS to 24, it is mapped to skprio 4:

[root]# cma_roce_tos -d mlx5_1 -t 24
24

Set Egress priority map (skprio#4 mapped to to user priority#3)

[root]# ip link set dev ens1f1.100 type vlan egress-qos-map 4:3
[root]# cat /proc/net/vlan/ens1f1.100 ens1f1.100 VID: 100 REORDER_HDR: 1 dev->priv_flags: 1
total frames received 0
total bytes received 0
Broadcast/Multicast Rcvd 0
total frames transmitted 0
total bytes transmitted 0
Device: ens1f1
INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
EGRESS priority mappings: 4:3

Upon the configuration, the default priority for RoCEv2 has the mapping, it will take effect on RDMC_CM QPs:

ToSSKB PriorityUser PriorityHW Traffic Class
24433

2.1.2. Switch configuration

This example provides configuration on switch with MLNX ONYX, it has port 1/3 and 1/4 used in the setup.

Enable VLAN on switch ethernet interfaces

(config) # vlan 100
(config) # interface ethernet 1/3 switchport mode hybrid
(config) # interface ethernet 1/4 switchport mode hybrid
(config) # interface ethernet 1/3 switchport hybrid allowed-vlan 100
(config) # interface ethernet 1/4 switchport hybrid allowed-vlan 100

Set trust mode on switch ethernet interfaces

(config) # interface ethernet 1/3 qos trust L2
(config) # interface ethernet 1/4 qos trust L2

Create lossless buffer pool on switch

(config) # traffic pool roce_lossless type lossless
(config) # traffic pool roce_lossless memory percent 50.00

Map switch priority 3 to lossless buffer pool

(config) # traffic pool roce_lossless map switch-priority 3

2.1.3. Verification

To verify the settings, generate RoCEv2 traffic on hosts

[server] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely -R
[client] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely -R [IP address]

Observe priority-based counters on host:

  • RoCEv2 traffic is expected to be transmitted via priority#3
  • (if any) PFC is expected to be transmitted via priority#3
  • (if any) CNP is expected to be transmitted via priority#6

[root]# ethtool -S ens1f1 | grep prio

Observe priority-based counters on switch ports:

  • RoCEv2 traffic is expected to be transmitted via tc 3
  • (if any) CNP is expected to be transmitted via tc 6

[root]# show interfaces ethernet [switch port] counters tc all

Observe traffic pool-based counters on switch ports, RoCEv2 traffic is expected to be transmitted via pg 1

(config) # show interfaces ethernet [switch port] counters pg all

Observe PFC counters on switch ports, (if any) PFC is expected to be observed in PFC 3

(config) # show interface ethernet [switch port] counters pfc prio all

2.2. RoCEv2 over Lossless Fabric (ECN+PFC) with Trust L3 QoS

2.2.1. Host configuration

This example provides configuration on host with Redhat 7.4. Dual-port ConnectX-5 is used, and ethernet port ens1f1 is used in the setup.

Set trust mode to DSCP on hosts

[root]# mlnx_qos -i ens1f1 –trust=dscp

Set default ToS to 106 (DSCP 26) for all RoCE traffic on the port, it will take effect on QPs created with RDMA_CM and rdma_set_option() is not used by application with option RDMA_OPTION_ID_TOS:

[root]# cma_roce_tos -d [mlx-device] -t 106

Set global TClass to 106 (DSCP 26) for all RoCE traffic on the port, pay attention that it is a global forced value, will be be applied to all QPs, has precedence over cma_roce_tos setting and value specified by user application:

[root]# echo 106 > /sys/class/infiniband/[mlx-device]/tc/1/traffic_class

Rules can be added to QPs for the global forced value. In the example, TClass will be set on loopback traffic with IP address 11.7.156.133:

[root]# echo “src_ip=11.7.156.133/32,dst_ip=11.7.156.133/32,tclass=106” > /sys/class/infiniband/mlx5_0/tc/1/traffic_class

Configure PFC on hosts (enable PFC for priority#3)

[root]# mlnx_qos -i ens1f1 --pfc=0,0,0,1,0,0,0,0

Enable ECN on priority 3 (optional, ECN is enabled by default)

[root]# echo 1 > /sys/class/net/ens1f1/ecn/roce_np/enable/3
[root]# echo 1 > /sys/class/net/ens1f1/ecn/roce_rp/enable/3

Set CNP L3 egress priority on 6 (optional, it is the default value)

[root]# echo 48 > /sys/class/net/ens1f1/ecn/roce_np/cnp_dscp

Enable ECN on the TCP traffic (optional, only when it is required for kernel TCP traffic)

[root]# sysctl -w net.ipv4.tcp_ecn=1
net.ipv4.tcp_ecn = 1

Set RoCE mode to V2 for RDMA CM traffic

[root]# cma_roce_mode -d mlx5_1 -p 1 -m 2
RoCE v2

Upon the configuration, the default priority for RoCEv2 has the mapping, it will take effect on QPs created with VPI verbs and RDMA_CM:

TClassToSDSCPUser PriorityHW Traffic Class
1061062633

2.2.2. Switch configuration

This example provides configuration on switch with MLNX ONYX, it has port 1/3 and 1/4 used in the setup.

Set trust mode on switch ethernet interfaces

(config) # interface ethernet 1/3 qos trust L3
(config) # interface ethernet 1/4 qos trust L3

Create lossless buffer pool on switch

(config) # traffic pool roce_lossless type lossless
(config) # traffic pool roce_lossless memory percent 50.00

Map switch priority 3 to lossless buffer pool

(config) # traffic pool roce_lossless map switch-priority 3

2.2.3. Verification

To verify the settings, generate RoCEv2 traffic on hosts with RDMA_CM QPs

[server] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely -R
[client] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely -R [P address]

Or generate RoCEv2 traffic on hosts with QPs created by VPI verbs

[server] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely
[client] # ib_write_bw -d [mlx device] -x [GID index] --run_infinitely [IP address]

Observe priority-based counters on host:

  • RoCEv2 traffic is expected to be transmitted via priority 3
  • (if any) PFC is expected to be transmitted via priority 3
  • (if any) CNP is expected to be transmitted via priority 6

[root] # ethtool -S ens1f1 | grep prio

Observe priority-based counters on switch ports:

  • RoCEv2 traffic is expected to be transmitted via tc 3
  • (if any) CNP is expected to be transmitted via tc 6

[root]# show interfaces ethernet [switch port] counters tc all

Observe traffic pool-based counters on switch ports, RoCEv2 traffic is expected to be transmitted via pg 1

(config) # show interfaces ethernet [switch port] counters pg all

Observe PFC counters on switch ports, (if any) PFC is expected to be observed in PFC 3

(config) # show interface ethernet [switch port] counters pfc prio all

<think>好的,用户想了解关于RDMA RoCE在Spine-Leaf网络架构中的实现或配置。首先,我需要回顾一下RoCE的基本概念,确保自己理解正确。RoCE允许通过以太网进行远程内存直接访问,降低延迟和CPU占用。用户提到的Spine-Leaf架构通常用于数据中心,具有高扩展性和低延迟的特点,所以这两者的结合应该涉及到网络配置的优化。 接下来,我需要整理Spine-Leaf架构的关键点。Spine层负责核心交换,Leaf层连接服务器,这种结构支持横向扩展。RoCE在这种架构中的实现需要考虑二层或三层的部署。RoCEv1是二层协议,而RoCEv2支持三层路由,更适合跨子网的情况,可能在Spine-Leaf中更常用。 然后,配置要点包括PFC(优先流量控制)、ECN(显式拥塞通知)、MTU设置、DCBX(数据中心桥接交换)和VLAN/VXLAN的支持。这些技术帮助管理网络拥塞,确保无损传输。例如,PFC可以暂停特定优先级的流量,防止丢包,而ECN可以在队列填满前通知发送端减速。MTU需要足够大以支持RoCE的数据包,通常设置为9000字节的巨帧。 此外,用户可能关心具体厂商的实现,比如思科或Arista的配置案例。引用中的思科白皮书提到VXLAN与RoCE的结合,这可能涉及到隧道技术来扩展二层网络。同时,拥塞管理在存储网络中很重要,引用3提到拥塞管理,需要确保网络设备支持相关特性。 最后,生成相关问题时要覆盖实现挑战、与普通以太网的区别、拥塞管理技术、VXLAN的作用以及厂商的具体配置案例。这些可以帮助用户深入理解不同方面的问题。</think>### RDMA over Converged Ethernet (RoCE) 在Spine-Leaf架构中的实现与配置 #### 1. Spine-Leaf架构与RoCE的适配性 Spine-Leaf(叶脊网络)是一种数据中心网络架构,其核心特点是: - **横向扩展性**:通过增加Spine层设备提升带宽容量; - **低延迟**:任意两个Leaf节点间最多经过两个交换节点; - **高带宽利用率**:通过多路径负载均衡避免网络瓶颈。 RoCEv2(基于UDP/IP的三层协议)天然适配该架构,因其支持跨子网路由,可通过Spine层设备实现跨Leaf节点的RDMA通信[^1]。 #### 2. 关键配置要素 | 配置项 | 作用说明 | 典型参数示例 | |-------------------|-------------------------------------------------------------------------|-----------------------| | **PFC(优先级流量控制)** | 为RoCE流量分配独立优先级队列(如IEEE 802.1p优先级6),避免网络拥塞导致丢包 | `priority-group 6` | | **ECN(显式拥塞通知)** | 在网络拥塞时标记数据包头部,触发端到端速率调整 | `ecn enable` | | **MTU配置** | 支持巨帧传输(RoCE建议9000字节MTU)以提升吞吐量 | `mtu 9216` | | **DCBX协议** | 自动协商QoS参数,确保端到端配置一致性 | `dcbx ets` | | **VXLAN支持** | 通过VXLAN隧道扩展二层域,实现跨Spine层的RoCEv2通信(需硬件卸载支持)[^3] | `vxlan udp-port 4789` | #### 3. 具体实现步骤(以Cisco NX-OS为例) ```bash # 启用PFC和ECN interface Ethernet1/1 priority-flow-control mode on congestion-control ecn threshold 50KB # 设置ECN触发阈值 # 配置DSCP映射(将RoCE流量标记为CS6) class-map type qos match-any ROCE match dscp 48 # CS6对应十进制48 policy-map type qos ROCE_QOS class ROCE set qos-group 6 bandwidth percent 40 # 分配带宽比例 # 配置VXLAN隧道 interface nve1 source-interface loopback0 member vni 10000 mcast-group 239.1.1.1 # 组播地址用于BUM流量 ``` #### 4. 性能优化建议 - **拓扑感知路由**:通过OSPF/BGP协议确保最短路径转发 - **无损网络设计**:保证端到端丢包率低于$10^{-5}$(RoCE协议敏感指标) - **硬件卸载验证**:使用`ethtool -k <interface>`检查`hw-tc-offload`状态
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值