DPDK mbuf机制详解-优快云博客

本文链接：https://blog.youkuaiyun.com/a777122/article/details/127062408

The mbuf library provides the ability to allocate and free buffers (mbufs) that may be used by the DPDK application to store message buffers. The message buffers are stored in a mempool, using the Mempool Library.

A rte_mbuf struct generally carries network packet buffers, but it can actually be any data (control data, events, …). The rte_mbuf header structure is kept as small as possible and currently uses just two cache lines, with the most frequently used fields being on the first of the two cache lines.

1、Design of Packet Buffers

For the storage of the packet data (including protocol headers), two approaches were considered:
- Embed metadata within a single memory buffer the structure followed by a fixed size area for the packet data.
- Use separate memory buffers for the metadata structure and for the packet data.
The advantage of the first method is that it only needs one operation to allocate/free the whole memory representation of a packet. On the other hand, the second method is more flexible and allows the complete separation of the allocation of metadata structures from the allocation of packet data buffers.

The first method was chosen for the DPDK. The metadata contains control information such as message type, length, offset to the start of the data and a pointer for additional mbuf structures allowing buffer chaining.

Message buffers that are used to carry network packets can handle buffer chaining where multiple buffers are required to hold the complete packet. This is the case for jumbo frames that are composed of many mbufs linked together through their next field.

For a newly allocated mbuf, the area at which the data begins in the message buffer is RTE_PKTMBUF_HEADROOM bytes after the beginning of the buffer, which is cache aligned. Message buffers may be used to carry control information, packets, events, and so on between different entities in the system. Message buffers may also use their buffer pointers to point to other message buffer data sections or other structures.
2、Buffers Stored in Memory Pools

The Buffer Manager uses the Mempool Library to allocate buffers. Therefore, it ensures that the packet header is interleaved optimally across the channels and ranks for L3 processing. An mbuf contains a field indicating the pool that it originated from. When calling rte_pktmbuf_free(m), the mbuf returns to its original pool.
3、Constructors

Packet mbuf constructors are provided by the API. The rte_pktmbuf_init() function initializes some fields in the mbuf structure that are not modified by the user once created (mbuf type, origin pool, buffer start address, and so on). This function is given as a callback function to the rte_mempool_create() function at pool creation time.
4、Allocating and Freeing mbufs

Allocating a new mbuf requires the user to specify the mempool from which the mbuf should be taken. For any newly-allocated mbuf, it contains one segment, with a length of 0. The offset to data is initialized to have some bytes of headroom in the buffer (RTE_PKTMBUF_HEADROOM).

Freeing a mbuf means returning it into its original mempool. The content of an mbuf is not modified when it is stored in a pool (as a free mbuf). Fields initialized by the constructor do not need to be re-initialized at mbuf allocation.

When freeing a packet mbuf that contains several segments, all of them are freed and returned to their original mempool.
5、Meta Information

An mbuf also contains the input port (where it comes from), and the number of segment mbufs in the chain.

For chained buffers, only the first mbuf of the chain stores this meta information.

For instance, this is the case on RX side for the IEEE1588 packet timestamp mechanism, the VLAN tagging and the IP checksum computation.

On TX side, it is also possible for an application to delegate some processing to the hardware if it supports it. For instance, the RTE_MBUF_F_TX_IP_CKSUM flag allows to offload the computation of the IPv4 checksum.
6、dpdk 程序中 mbuf 的流动

每个Core（图中左边)上运行的程序想要获取mbuf时，最开始都是从ring中申请，但同时会多申请一些放入cache中用于下次的快速申请。mbuf 在创建 pktmbuf pool 的时候被放到以 ring 为代表的队列中，在开启网卡收包的时候会为每一个接收描述符申请一个 mbuf，并将 mbuf 中 dataroom 区域的总线地址写入到描述符的相关字段中，用以 dma 处理时网卡填充报文到主机内存。

网卡收包时 mbuf 的流动：

接口 up 的时候 dpdk 会为每个收包队列上的描述符申请 mbuf 并对 dataroom 的总线地址做 dma 映射，由于描述符的基地址与长度写入了网卡寄存器，硬件能够操作描述符。

硬件收到一个正常的包后会将包拷贝到一个可用的描述符中配置的 dma 地址中，同时回写描述符中的不同字段。

软件收包时，首先判断是否有描述符上绑定的 dma 地址填充了报文，对 intel 的网卡来说，一般通过检查描述符的 dd 位是否为 1 来判断。

当存在一个可用的描述符时，收包函数会解析描述符内容，同时获取到此描述符绑定的 mbuf，并用描述符中的不同字段填充 mbuf 中的一些字段，保留解析描述符的结果。

此后软件在将这个 mbuf 返回上层前，需要重新分配一个新的 mbuf，并将其 dataroom 起始地址的总线地址填充到描述符中，这里的逻辑类似"狸猫换太子"，不过对象换成了空的 mbuf 与已经填充了报文的 mbuf。

当 mbuf 申请失败时，没有新的 mbuf 补充，收包会终止，dpdk 内部有一个 mbuf 申请失败的字段，此字段会加 1，当接口不收包时可以观测此字段确认是否由于 mbuf 泄露导致申请 mbuf 失败进而导致接口不收包。

网卡发包时 mbuf 的流动:

网卡发包时，上层将待发送的 mbuf 的指针数组传递到发包函数中。在发包函数中为每一个待发送的包分配一个空闲的发送描述符，同样，mbuf 的 dataroom 起始地址的总线地址会填充到描述符中，此外 mbuf 中的一些字段也会用于发包描述符填充。

这里存在一个问题：发包时我们填充 mbuf 的 dataroom 起始地址的总线地址到描述符中后，并不会等待硬件发送完成后释放 mbuf，那 mbuf 是在哪里释放的？难道没有释放吗？

在发包函数里面即时判断报文是否发送完成然后释放 mbuf 是可行的，但是这额外的等待带来的是性能的损耗。

intel 网卡的发包函数中进行了如下优化：

在获取到一个空闲的发包描述符时判断此描述符上是否已经绑定了 mbuf，如果已经绑定了表明这个包已经发送完成，就释放 mbuf。故而上一次绑定到描述符上的 mbuf，会在下一次这个描述符状态空闲并被软件再次分配使用的时候释放，这样既不影响功能，也提高了程序的性能。

一些驱动中同时使用 tx_free_thresh 门限，当空闲的描述符个数小于此门限值时，驱动会重新扫描描述符找到其它空闲的描述符。

多个程序中 mbuf 的流动：

基于 dpdk 开发的数通引擎可以主动申请 mbuf 并填充报文，然后调用发包函数发送出去。在收到包时可以将报文丢到 ring 中，通过 ring 来将报文传送到指定位置，实现与诸如安全引擎等的联动，这一过程是相互的，相互性意味着安全引擎也存在将处理过后的 mbuf 报文通过 ring 传送回数通引擎的情况。

这里的 ring 只是一种实现方案，dpdk 的无锁 ring 针对的是单个生产者与单个消费者的情况，在 dpdk 多进程方案设计时，为了避免对 ring 进行互斥保护，可以为每个 mbuf 传递方向都创建独立的 ring。