How SKBs work(图解sk_buff数据操作)

本文详细解析了Linux内核中sk_buff数据结构的布局和使用方法,包括如何预留头部空间、添加用户数据、构建协议头等操作。同时,介绍了非线性数据区域的实现和工作原理,以及如何处理分页数据。

http://vger.kernel.org/~davem/skb_data.html

Layout of SKB data area

This first diagram illustrates the layout of the SKB data area and where in that area the various pointers in 'struct sk_buff' point.

The rest of this page will walk through what the SKB data area looks like in a newly allocated SKB. How to modify those pointers to add headers, add user data, and pop headers.

Also, we will discuss how page non-linear data areas are implemented. We will also discuss how to work with them.

 


	skb = alloc_skb(len, GFP_KERNEL);

Layout of freshly allocated SKB

This is what a new SKB looks like right after you allocate it using alloc_skb()

As you can see, the head, data, and tail pointers all point to the beginning of the data buffer. And the end pointer points to the end of it. Note that all of the data area is considered tail room.

The length of this SKB is zero, it isn't very interesting since it doesn't contain any packet data at all. Let's reserve some space for protocol headers using skb_reserve()

 


	skb_reserve(skb, header_len);

Layout of SKB after skb_reserve()

This is what a new SKB looks like right after the skb_reserve() call.

Typically, when building output packets, we reserve enough bytes for the maximum amount of header space we think we'll need. Most IPV4 protocols can do this by using the socket value sk->sk_prot->max_header.

When setting up receive packets that an ethernet device will DMA into, we typically call skb_reserve(skb, NET_IP_ALIGN). By default NET_IP_ALIGN is defined to '2'. This makes it so that, after the ethernet header, the protocol header will be aligned on at least a 4-byte boundary. Nearly all of the IPV4 and IPV6 protocol processing assumes that the headers are properly aligned.

Let's now add some user data to the packet.

 


	unsigned char *data = skb_put(skb, user_data_len);
	int err = 0;
	skb->csum = csum_and_copy_from_user(user_pointer, data,
					    user_data_len, 0, &err);
	if (err)
		goto user_fault;

Layout of SKB after skb_reserve()

This is what a new SKB looks like right after the user data is added.

skb_put() advances 'skb->tail' by the specified number of bytes, it also increments 'skb->len' by that number of bytes as well. This routine must not be called on a SKB that has any paged data. You must also be sure that there is enough tail room in the SKB for the amount of bytes you are trying to put. Both of these conditions are checked for by skb_put() and an assertion failure will trigger if either rule is violated.

The computed checksum is remembered in 'skb->csum'. Now, it's time to build the protocol headers. We'll build a UDP header, then one for IPV4.

 


	struct inet_sock *inet = inet_sk(sk);
	struct flowi *fl = &inet->cork.fl;
	struct udphdr *uh;

	skb->h.raw = skb_push(skb, sizeof(struct udphdr));
	uh = skb->h.uh
	uh->source = fl->fl_ip_sport;
	uh->dest = fl->fl_ip_dport;
	uh->len = htons(user_data_len);
	uh->check = 0;
	skb->csum = csum_partial((char *)uh,
				 sizeof(struct udphdr), skb->csum);
	uh->check = csum_tcpudp_magic(fl->fl4_src, fl->fl4_dst,
				      user_data_len, IPPROTO_UDP, skb->csum);
	if (uh->check == 0)
		uh->check = -1;

Layout of SKB after pushing UDP headers

This is what a new SKB looks like after we push the UDP header to the front of the SKB.

skb_push() will decrement the 'skb->data' pointer by the specified number of bytes. It will also increment 'skb->len' by that number of bytes as well. The caller must make sure there is enough head room for the push being performed. This condition is checked for by skb_push() and an assertion failure will trigger if this rule is violated.

Now, it's time to tack on an IPV4 header.

 


	struct rtable *rt = inet->cork.rt;
	struct iphdr *iph;

	skb->nh.raw = skb_push(skb, sizeof(struct iphdr));
	iph = skb->nh.iph;
	iph->version = 4;
	iph->ihl = 5;
	iph->tos = inet->tos;
	iph->tot_len = htons(skb->len);
	iph->frag_off = 0;
	iph->id = htons(inet->id++);
	iph->ttl = ip_select_ttl(inet, &rt->u.dst);
	iph->protocol = sk->sk_protocol; /* IPPROTO_UDP in this case */
	iph->saddr = rt->rt_src;
	iph->daddr = rt->rt_dst;
	ip_send_check(iph);

	skb->priority = sk->sk_priority;
	skb->dst = dst_clone(&rt->u.dst);

Layout of SKB after pushing IP header

This is what a new SKB looks like after we push the IPv4 header to the front of the SKB.

 

Just as above for UDP, skb_push() decrements 'skb->data' and increments 'skb->len'. We update the 'skb->nh.raw' pointer to the beginning of the new space, and build the IPv4 header.

This packet is basically ready to be pushed out to the device once we have the necessary information to build the ethernet header (from the generic neighbour layer and ARP).

 


Things start to get a little bit more complicated once paged data begins to be used. For the most part the ability to use [page, offset, len] tuples for SKB data came about so that file system file contents could be directly sent over a socket. But, as it turns out, it is sometimes beneficial to use this for nomal buffering of process sendmsg() data.

It must be understood that once paged data starts to be used on an SKB, this puts a specific restriction on all future SKB data area operations. In particular, it is no longer possible to do skb_put() operations.

We will now mention that there are actually two length variables assosciated with an SKB, len and data_len. The latter only comes into play when there is paged data in the SKB. skb->data_len tells how many bytes of paged data there are in the SKB. From this we can derive a few more things:

  • The existence of paged data in an SKB is indicated by skb->data_len being non-zero. This is codified in the helper routine skb_is_nonlinear() so that it the function you should use to test this.
  • The amount of non-paged data at skb->data can be calculated as skb->len - skb->data_len. Again, there is a helper routine already defined for this called skb_headlen() so please use that.

The main abstraction is that, when there is paged data, the packet begins at skb->data for skb_headlen(skb) bytes, then continues on into the paged data area for skb->data_len bytes. That is why it is illogical to try and do an skb_put(skb) when there is paged data. You have to add data onto the end of the paged data area instead.

 

Each chunk of paged data in an SKB is described by the following structure:

struct skb_frag_struct {
	struct page *page;
	__u16 page_offset;
	__u16 size;
};

There is a pointer to the page (which you must hold a proper reference to), the offset within the page where this chunk of paged data starts, and how many bytes are there.

 

The paged frags are organized into an array in the shared SKB area, defined by this structure:

#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)

struct skb_shared_info {
	atomic_t dataref;
	unsigned int	nr_frags;
	unsigned short	tso_size;
	unsigned short	tso_segs;
	struct sk_buff	*frag_list;
	skb_frag_t	frags[MAX_SKB_FRAGS];
};

The nr_frags member states how many frags there are active in the frags[] array. The tso_size and tso_segs is used to convey information to the device driver for TCP segmentation offload. The frag_list is used to maintain a chain of SKBs organized for fragmentation purposes, it is _not_ used for maintaining paged data. And finally the frags[] holds the frag descriptors themselves.

 

A helper routine is available to help you fill in page descriptors.


void skb_fill_page_desc(struct sk_buff *skb, int i,
			struct page *page,
			int off, int size)

This fills the i'th page vector to point to page at offset off of size size. It also updates the nr_frags member to be one past i.

 

If you wish to simply extend an existing frag entry by some number of bytes, increment the size member by that amount.


With all of the complications imposed by non-linear SKBs, it may seem difficult to inspect areas of a packet in a straightforward way, or to copy data out from a packet into another buffer. This is not the case. There are two helper routines available which make this pretty easy.

First, we have:


 

void *skb_header_pointer(const struct sk_buff *skb, int offset, int len, void *buffer)


You give it the SKB, the offset (in bytes) to the piece of data you are interested in, the number of bytes you want, and a local buffer which is to be used _only_ if the data you are interested in resides in the non-linear data area.

 

You are returned a pointer to the data item, or NULL if you asked for an invalid offset and len parameter. This pointer could be one of two things. First, if what you asked for is directly in the skb->data linear data area, you are given a direct pointer into there. Else, you are given the buffer pointer you passed in.

Code inspecting packet headers on the output path, especially, should use this routine to read and interpret protocol headers. The netfilter layer uses this function heavily.

For larger pieces of data other than protocol headers, it may be more appropriate to use the following helper routine instead.


int skb_copy_bits(const struct sk_buff *skb, int offset,
		  void *to, int len);

This will copy the specified number of bytes, and the specified offset, of the given SKB into the 'to'buffer. This is used for copies of SKB data into kernel buffers, and therefore it is not to be used for copying SKB data into userspace. There is another helper routine for that:


int skb_copy_datagram_iovec(const struct sk_buff *from,
			    int offset, struct iovec *to,
			    int size);

Here, the user's data area is described by the given IOVEC. The other parameters are nearly identical to those passed in to skb_copy_bits() above.

static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct net_device *dev, struct netdev_queue *txq) { spinlock_t *root_lock = qdisc_lock(q); struct sk_buff *to_free = NULL; bool contended; int rc; qdisc_calculate_pkt_len(skb, q); if (q->flags & TCQ_F_NOLOCK) { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; qdisc_run(q); if (unlikely(to_free)) kfree_skb_list(to_free); return rc; } /* * Heuristic to force contended enqueues to serialize on a * separate lock before trying to get qdisc main lock. * This permits qdisc->running owner to get the lock more * often and dequeue packets faster. */ contended = qdisc_is_running(q); if (unlikely(contended)) spin_lock(&q->busylock); spin_lock(root_lock); if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { __qdisc_drop(skb, &to_free); rc = NET_XMIT_DROP; } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && qdisc_run_begin(q)) { /* * This is a work-conserving queue; there are no old skbs * waiting to be sent out; and the qdisc is not running - * xmit the skb directly. */ qdisc_bstats_update(q, skb); if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) { if (unlikely(contended)) { spin_unlock(&q->busylock); contended = false; } __qdisc_run(q); } qdisc_run_end(q); rc = NET_XMIT_SUCCESS; } else { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; if (qdisc_run_begin(q)) { if (unlikely(contended)) { spin_unlock(&q->busylock); contended = false; } __qdisc_run(q); qdisc_run_end(q); } } spin_unlock(root_lock); if (unlikely(to_free)) kfree_skb_list(to_free); if (unlikely(contended)) spin_unlock(&q->busylock); return rc; }分析一下这个函数
最新发布
10-25
<think>我们分析的是Linux内核网络栈中的__dev_xmit_skb函数。该函数负责将skb(socket buffer)传递给qdisc(排队规则)进行传输。 函数的主要逻辑如下: 1. 首先,计算数据包的长度(qdisc_calculate_pkt_len)。 2. 如果qdisc的标志包含TCQ_F_NOLOCK(无锁模式),则直接调用q->enqueue将skb加入队列,然后尝试运行qdisc(qdisc_run),最后释放可能需要释放的skb(如果有的话),并返回结果。 3. 对于需要锁的情况(即没有TCQ_F_NOLOCK): a. 先判断qdisc是否正在运行(qdisc_is_running),如果正在运行(contended为真),则先获取busylock(一个用于减少主锁竞争的自旋锁)。 b. 然后获取qdisc的主锁(root_lock)。 c. 检查qdisc是否已经被停用(__QDISC_STATE_DEACTIVATED状态),如果是,则丢弃skb,返回NET_XMIT_DROP。 d. 如果qdisc允许绕过(TCQ_F_CAN_BYPASS标志)且当前队列长度为0(qdisc_qlen(q)==0)并且可以开始运行(qdisc_run_begin),则尝试直接发送(sch_direct_xmit)。 - 更新qdisc的统计信息(qdisc_bstats_update)。 - 调用sch_direct_xmit直接发送,如果发送后需要重新调度qdisc(sch_direct_xmit返回非0),则调用__qdisc_run运行qdisc。 - 结束运行(qdisc_run_end),返回NET_XMIT_SUCCESS。 e. 否则,调用enqueue将skb加入队列,然后尝试运行qdisc(如果qdisc_run_begin返回真,则运行qdisc)。 4. 释放锁,清理资源(释放可能存在的to_free链表中的skb,释放busylock)。 关键点: - 无锁模式(TCQ_F_NOLOCK)下,操作更轻量,但仅适用于某些qdisc(如无锁的队列)。 - 在有锁模式下,通过busylock来减少主锁的竞争,提高性能。 - 允许绕过的条件(TCQ_F_CAN_BYPASS)下,如果队列为空且qdisc没有在运行,则直接发送,避免入队再出队的开销。 返回值:返回网络层的发送结果(NET_XMIT_MASK掩码后的值)。 注意:sch_direct_xmit函数尝试直接发送数据包,如果发送成功并且队列中没有其他数据包,则可能无需调度qdisc。如果发送失败(例如设备忙),则会将数据包放入队列并调度qdisc。 接下来,根据要求,我们提出相关问题: §§相关问题§§ 1. 在什么情况下,qdisc会设置TCQ_F_NOLOCK标志?这种无锁队列的实现机制是怎样的? 2. sch_direct_xmit函数内部的具体执行流程是怎样的?它如何处理发送失败的情况? 3. qdisc_run_begin和qdisc_run_end函数在内核中如何实现?它们的作用是什么? 4. 当qdisc队列被停用(__QDISC_STATE_DEACTIVATED)时,哪些操作会导致这种状态? 5. 在多核环境下,__dev_xmit_skb函数如何避免锁竞争导致的性能下降? </think>### `__dev_xmit_skb` 函数深度分析 该函数是 Linux 内核网络栈中**数据包发送路径的核心函数**,负责处理数据包(`skb`)进入排队规则(`qdisc`)的逻辑。以下是关键逻辑解析: --- #### **1. 函数参数解析** - `skb`:待发送的数据包 - `q`:流量控制队列规则(如 `pfifo_fast`) - `dev`:网络设备 - `txq`:设备发送队列 #### **2. 核心逻辑流程** ```mermaid graph TD A[开始] --> B[计算数据包长度] B --> C{是否为无锁队列?<br/>TCQ_F_NOLOCK} C -- 是 --> D[直接入队并运行qdisc] C -- 否 --> E{队列是否被占用?} E -- 是 --> F[获取busylock减少竞争] F --> G[获取队列主锁<br/>root_lock] E -- 否 --> G G --> H{队列是否停用?} H -- 是 --> I[丢弃数据包] H -- 否 --> J{是否可绕过队列?<br/>TCQ_F_CAN_BYPASS} J -- 是 --> K[直接发送<br/>sch_direct_xmit] J -- 否 --> L[正常入队] K & L --> M[运行qdisc处理队列] M --> N[释放所有锁] ``` #### **3. 关键分支详解** **A. 无锁模式(TCQ_F_NOLOCK)** ```c if (q->flags & TCQ_F_NOLOCK) { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; qdisc_run(q); // 立即触发队列处理 kfree_skb_list(to_free); // 释放丢弃的包 return rc; } ``` - 适用于高性能无锁 qdisc(如 `pfifo_fast`) - 直接调用 `enqueue` 入队并立即运行 qdisc **B. 可绕过队列的优化路径** ```c if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && qdisc_run_begin(q)) { qdisc_bstats_update(q, skb); // 更新统计 if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) { __qdisc_run(q); // 若发送未完成则调度队列 } rc = NET_XMIT_SUCCESS; // 标记成功 } ``` - 触发条件:队列为空 + 可绕过标志 + 队列未运行 - 通过 `sch_direct_xmit` 直接发送数据包 - 避免入队/出队开销,显著提升零队列场景性能 **C. 标准入队路径** ```c rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; if (qdisc_run_begin(q)) { __qdisc_run(q); // 激活队列处理 } ``` - 普通入队逻辑(如 RED、HTB 等复杂 qdisc) - 调用具体 qdisc 的 `enqueue` 方法实现入队 #### **4. 锁机制设计** - **三级锁策略**: 1. `busylock`:缓解竞争的自旋锁(仅当队列运行时获取) 2. `root_lock`:qdisc 主锁(必须获取) 3. 设备队列锁(在 `sch_direct_xmit` 内部获取) - 锁获取顺序:`busylock` → `root_lock` → 设备队列锁 - 通过 `contended` 标志智能避免无效锁竞争 #### **5. 错误处理逻辑** - **队列停用检测**: ```c if (test_bit(__QDISC_STATE_DEACTIVATED, &q->state)) { __qdisc_drop(skb, &to_free); // 直接丢弃数据包 rc = NET_XMIT_DROP; } ``` - **资源释放**: ```c if (unlikely(to_free)) kfree_skb_list(to_free); // 释放丢弃的skb链表 ``` ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值