How SKBs work(图解sk_buff数据操作)

本文详细解析了Linux内核中sk_buff数据结构的布局和使用方法,包括如何预留头部空间、添加用户数据、构建协议头等操作。同时,介绍了非线性数据区域的实现和工作原理,以及如何处理分页数据。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

http://vger.kernel.org/~davem/skb_data.html

Layout of SKB data area

This first diagram illustrates the layout of the SKB data area and where in that area the various pointers in 'struct sk_buff' point.

The rest of this page will walk through what the SKB data area looks like in a newly allocated SKB. How to modify those pointers to add headers, add user data, and pop headers.

Also, we will discuss how page non-linear data areas are implemented. We will also discuss how to work with them.

 


	skb = alloc_skb(len, GFP_KERNEL);

Layout of freshly allocated SKB

This is what a new SKB looks like right after you allocate it using alloc_skb()

As you can see, the head, data, and tail pointers all point to the beginning of the data buffer. And the end pointer points to the end of it. Note that all of the data area is considered tail room.

The length of this SKB is zero, it isn't very interesting since it doesn't contain any packet data at all. Let's reserve some space for protocol headers using skb_reserve()

 


	skb_reserve(skb, header_len);

Layout of SKB after skb_reserve()

This is what a new SKB looks like right after the skb_reserve() call.

Typically, when building output packets, we reserve enough bytes for the maximum amount of header space we think we'll need. Most IPV4 protocols can do this by using the socket value sk->sk_prot->max_header.

When setting up receive packets that an ethernet device will DMA into, we typically call skb_reserve(skb, NET_IP_ALIGN). By default NET_IP_ALIGN is defined to '2'. This makes it so that, after the ethernet header, the protocol header will be aligned on at least a 4-byte boundary. Nearly all of the IPV4 and IPV6 protocol processing assumes that the headers are properly aligned.

Let's now add some user data to the packet.

 


	unsigned char *data = skb_put(skb, user_data_len);
	int err = 0;
	skb->csum = csum_and_copy_from_user(user_pointer, data,
					    user_data_len, 0, &err);
	if (err)
		goto user_fault;

Layout of SKB after skb_reserve()

This is what a new SKB looks like right after the user data is added.

skb_put() advances 'skb->tail' by the specified number of bytes, it also increments 'skb->len' by that number of bytes as well. This routine must not be called on a SKB that has any paged data. You must also be sure that there is enough tail room in the SKB for the amount of bytes you are trying to put. Both of these conditions are checked for by skb_put() and an assertion failure will trigger if either rule is violated.

The computed checksum is remembered in 'skb->csum'. Now, it's time to build the protocol headers. We'll build a UDP header, then one for IPV4.

 


	struct inet_sock *inet = inet_sk(sk);
	struct flowi *fl = &inet->cork.fl;
	struct udphdr *uh;

	skb->h.raw = skb_push(skb, sizeof(struct udphdr));
	uh = skb->h.uh
	uh->source = fl->fl_ip_sport;
	uh->dest = fl->fl_ip_dport;
	uh->len = htons(user_data_len);
	uh->check = 0;
	skb->csum = csum_partial((char *)uh,
				 sizeof(struct udphdr), skb->csum);
	uh->check = csum_tcpudp_magic(fl->fl4_src, fl->fl4_dst,
				      user_data_len, IPPROTO_UDP, skb->csum);
	if (uh->check == 0)
		uh->check = -1;

Layout of SKB after pushing UDP headers

This is what a new SKB looks like after we push the UDP header to the front of the SKB.

skb_push() will decrement the 'skb->data' pointer by the specified number of bytes. It will also increment 'skb->len' by that number of bytes as well. The caller must make sure there is enough head room for the push being performed. This condition is checked for by skb_push() and an assertion failure will trigger if this rule is violated.

Now, it's time to tack on an IPV4 header.

 


	struct rtable *rt = inet->cork.rt;
	struct iphdr *iph;

	skb->nh.raw = skb_push(skb, sizeof(struct iphdr));
	iph = skb->nh.iph;
	iph->version = 4;
	iph->ihl = 5;
	iph->tos = inet->tos;
	iph->tot_len = htons(skb->len);
	iph->frag_off = 0;
	iph->id = htons(inet->id++);
	iph->ttl = ip_select_ttl(inet, &rt->u.dst);
	iph->protocol = sk->sk_protocol; /* IPPROTO_UDP in this case */
	iph->saddr = rt->rt_src;
	iph->daddr = rt->rt_dst;
	ip_send_check(iph);

	skb->priority = sk->sk_priority;
	skb->dst = dst_clone(&rt->u.dst);

Layout of SKB after pushing IP header

This is what a new SKB looks like after we push the IPv4 header to the front of the SKB.

 

Just as above for UDP, skb_push() decrements 'skb->data' and increments 'skb->len'. We update the 'skb->nh.raw' pointer to the beginning of the new space, and build the IPv4 header.

This packet is basically ready to be pushed out to the device once we have the necessary information to build the ethernet header (from the generic neighbour layer and ARP).

 


Things start to get a little bit more complicated once paged data begins to be used. For the most part the ability to use [page, offset, len] tuples for SKB data came about so that file system file contents could be directly sent over a socket. But, as it turns out, it is sometimes beneficial to use this for nomal buffering of process sendmsg() data.

It must be understood that once paged data starts to be used on an SKB, this puts a specific restriction on all future SKB data area operations. In particular, it is no longer possible to do skb_put() operations.

We will now mention that there are actually two length variables assosciated with an SKB, len and data_len. The latter only comes into play when there is paged data in the SKB. skb->data_len tells how many bytes of paged data there are in the SKB. From this we can derive a few more things:

  • The existence of paged data in an SKB is indicated by skb->data_len being non-zero. This is codified in the helper routine skb_is_nonlinear() so that it the function you should use to test this.
  • The amount of non-paged data at skb->data can be calculated as skb->len - skb->data_len. Again, there is a helper routine already defined for this called skb_headlen() so please use that.

The main abstraction is that, when there is paged data, the packet begins at skb->data for skb_headlen(skb) bytes, then continues on into the paged data area for skb->data_len bytes. That is why it is illogical to try and do an skb_put(skb) when there is paged data. You have to add data onto the end of the paged data area instead.

 

Each chunk of paged data in an SKB is described by the following structure:

struct skb_frag_struct {
	struct page *page;
	__u16 page_offset;
	__u16 size;
};

There is a pointer to the page (which you must hold a proper reference to), the offset within the page where this chunk of paged data starts, and how many bytes are there.

 

The paged frags are organized into an array in the shared SKB area, defined by this structure:

#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)

struct skb_shared_info {
	atomic_t dataref;
	unsigned int	nr_frags;
	unsigned short	tso_size;
	unsigned short	tso_segs;
	struct sk_buff	*frag_list;
	skb_frag_t	frags[MAX_SKB_FRAGS];
};

The nr_frags member states how many frags there are active in the frags[] array. The tso_size and tso_segs is used to convey information to the device driver for TCP segmentation offload. The frag_list is used to maintain a chain of SKBs organized for fragmentation purposes, it is _not_ used for maintaining paged data. And finally the frags[] holds the frag descriptors themselves.

 

A helper routine is available to help you fill in page descriptors.


void skb_fill_page_desc(struct sk_buff *skb, int i,
			struct page *page,
			int off, int size)

This fills the i'th page vector to point to page at offset off of size size. It also updates the nr_frags member to be one past i.

 

If you wish to simply extend an existing frag entry by some number of bytes, increment the size member by that amount.


With all of the complications imposed by non-linear SKBs, it may seem difficult to inspect areas of a packet in a straightforward way, or to copy data out from a packet into another buffer. This is not the case. There are two helper routines available which make this pretty easy.

First, we have:


 

void *skb_header_pointer(const struct sk_buff *skb, int offset, int len, void *buffer)


You give it the SKB, the offset (in bytes) to the piece of data you are interested in, the number of bytes you want, and a local buffer which is to be used _only_ if the data you are interested in resides in the non-linear data area.

 

You are returned a pointer to the data item, or NULL if you asked for an invalid offset and len parameter. This pointer could be one of two things. First, if what you asked for is directly in the skb->data linear data area, you are given a direct pointer into there. Else, you are given the buffer pointer you passed in.

Code inspecting packet headers on the output path, especially, should use this routine to read and interpret protocol headers. The netfilter layer uses this function heavily.

For larger pieces of data other than protocol headers, it may be more appropriate to use the following helper routine instead.


int skb_copy_bits(const struct sk_buff *skb, int offset,
		  void *to, int len);

This will copy the specified number of bytes, and the specified offset, of the given SKB into the 'to'buffer. This is used for copies of SKB data into kernel buffers, and therefore it is not to be used for copying SKB data into userspace. There is another helper routine for that:


int skb_copy_datagram_iovec(const struct sk_buff *from,
			    int offset, struct iovec *to,
			    int size);

Here, the user's data area is described by the given IOVEC. The other parameters are nearly identical to those passed in to skb_copy_bits() above.

<think>嗯,用户让我分析这个vhost_rdma_send函数。首先,我需要理解这个函数的功能和上下文。看起来这是一个涉及虚拟化RDMA(远程直接内存访问)的发送数据包函数,可能属于某个高性能网络库,比如DPDK,因为里面用了rte_mbuf和rte_atomic函数。 首先,函数参数是pkt和mbuf。pkt可能包含QP(队列对)的信息,mbuf是DPDK中的数据包缓冲区。函数开始处声明了一些变量,包括错误码err,mbuf_out计数器,以及从pkt中获取qp指针。然后记录调试日志,显示QP号、类型、操作码和PSN。 接下来,函数增加了qp的引用计数,这可能用于资源管理,防止在操作过程中qp被释放。然后通过原子操作增加mbuf_out,这个变量可能跟踪正在处理的mbuf数量,用于流量控制。 然后根据mbuf的l3_type决定调用ip_out处理IPv4或IPv6。如果协议类型未知,记录错误,释放引用,并返回错误码-EINVAL。这里的错误处理需要注意资源释放,比如调用vhost_rdma_drop_ref和rte_pktmbuf_free。 处理完ip_out后,用原子操作减少mbuf_out,并检查是否需要请求更多的mbuf。如果mbuf_out低于某个阈值,触发任务来补充mbuf。这可能是一种流量控制机制,确保有足够的缓冲区处理后续的数据包。 最后再次减少引用计数,如果之前的ip_out出错,记录错误并返回-EAGAIN,否则返回0成功。 需要确认几个点:vhost_rdma_add_ref和drop_ref的具体实现,确保引用计数正确管理。原子操作的使用是否必要,是否存在多线程竞争。mbuf_out的作用是否准确,是否需要同步机制。错误处理路径是否正确释放资源,避免内存泄漏。 可能的问题点:在错误情况下,是否所有资源都被正确释放?比如,如果在ip_out之前增加引用,之后在错误分支中drop_ref,而正常流程中是否有平衡?例如,函数开始有add_ref,之后在错误处理和正常结束都有drop_ref,应该保持引用计数正确。 另外,任务vhost_rdma_run_task的触发条件,当mbuf_out低于阈值时,是否需要异步处理来补充缓冲区,这可能影响性能或资源管理。 用户可能是开发者,在调试或优化该函数时遇到问题,需要理解其逻辑和潜在问题。可能想知道函数的作用、流程,以及如何确保正确性和性能。可能涉及RDMA协议栈的实现细节,特别是在虚拟化环境中的处理。</think>这是一个基于DPDK的vHost RDMA数据包发送函数的实现分析,主要处理网络层协议封装和资源管理。以下为分步解析: $$vhost\_rdma\_send()\ 函数流程$$ 1. **参数和初始化** - 输入参数: - `struct vhost_rdma_pkt_info *pkt`:包含QP(队列对)信息、操作码等元数据 - `struct rte_mbuf *mbuf`:DPDK数据包缓冲区 - 获取QP指针:`qp = pkt->qp` - 记录调试日志:输出QPN/QP类型/操作码/PSN等关键字段 2. **资源引用管理** ```c vhost_rdma_add_ref(qp); // 增加QP引用计数 rte_atomic32_inc(&pkt->qp->mbuf_out); // 原子操作增加正在处理的mbuf计数器 ``` 3. **网络层协议处理** ```c if (mbuf->l3_type == VHOST_NETWORK_TYPE_IPV4) { err = ip_out(pkt, mbuf, RTE_ETHER_TYPE_IPV4); // IPv4封装 } else if (...) { // IPv6处理 } else { // 错误处理路径:释放资源并返回-EINVAL } ``` 4. **流量控制机制** ```c mbuf_out = rte_atomic32_sub_return(&qp->mbuf_out, 1); // 原子递减计数器 if (需要补充缓冲区 && mbuf_out < 阈值) { vhost_rdma_run_task(...); // 触发异步任务补充mbuf } ``` 5. **错误处理与返回** - 释放QP引用:`vhost_rdma_drop_ref()` - 错误码映射: - `-EINVAL`:无效协议类型 - `-EAGAIN`:IP封装失败 - `0`:成功 $$关键设计点分析$$ 1. **原子操作的应用** - `rte_atomic32_inc/sub_return`:确保多线程环境下`mbuf_out`计数器的线程安全 - 流量控制阈值`VHOST_INFLIGHT_SKBS_PER_QP_LOW`:防止缓冲区饥饿 2. **资源生命周期管理** - 引用计数机制`add_ref/drop_ref`:确保QP在使用期间不被意外释放 - 错误路径中的`rte_pktmbuf_free`:及时释放无效数据包 3. **协议扩展性** - L3协议类型判断结构便于后续扩展其他网络层协议 - `ip_out()`函数抽象封装不同协议的发送逻辑 该函数典型应用于云计算场景中基于vHost的RDMA虚拟化网络加速,通过DPDK实现高性能数据面处理。开发时需特别注意原子操作的正确性和引用计数平衡。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值