TCP 半连接队列与全连接队列

最新推荐文章于 2024-08-15 22:21:43 发布

weixin_33810006

最新推荐文章于 2024-08-15 22:21:43 发布

阅读量244

点赞数

CC 4.0 BY-SA版权

文章标签：操作系统网络开发工具

原文链接：https://my.oschina.net/HY1024/blog/1632386

本文深入探讨了TCP连接建立过程中的SYN队列与accept队列机制，详细解释了Linux和BSD系统中backlog参数的不同作用及其实现原理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

为什么80%的码农都做不了架构师？>>>

说明

此文来自网络，个人根据自己理解翻译后并写了一些自己的读后感。

原文链接：https://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html

TCP 连接的建立见 TCP连接的建立和关闭

重点：被动连接的一方状态变化：LISTEN （收到连接方 SYN，发送SYN/ACK）-> SYN-RCVD -> (收到连接方 ACK，发送 ACK ) ESTABLISHED

正文

在 LISTEN 到 ESTABLISHED 之间，可能会有很多请求过来，那么有两种方式来处理积压队列：

使用一个队列，向 listen 调用传递一个 backlog 参数来设置队列大小。当一个 SYN 到达的时候，回复 SYN/ACK 并将此连接丢到队列里，当收到确认 ACK 时，将此连接状态更新为 ESTABLISHED 并移交给程序处理。也就是说，该队列包含的连接有两种状态：SYN-RCVD 和 ESTABLISHED。通过 accept 只能获取到 ESTABLISHED 状态的连接。
使用两个队列，一个 SYN 队列（或叫半连接队列），一个 accept 队列（或叫全连接队列）。SYN 队列只包含 SYN-RCVD 状态的连接，当收到确认 ACK 之后，将此连接状态更新为 ESTABLISHED 并移出到 accept 队列里。accept 调用将消费积压在 accept 队列里的连接。在这种处理方式里，listen 调用的 backlog 参数只控制 accept 队列的大小。

BSD 系统使用了第一种方式，也是就是，当队列（同时存在 SYN-RCVD 和 ESTABLISHED 状态的连接）大小达到 backlog 限制时，即使系统收到新的 SYN 请求，也不再回复 SYN/ACK 。通常 TCP 实现会直接丢弃这个 SYN 包（而不是回复一个 RST 包），然后客户端未收到 ACK 会触发超时重传。（见于 W. Richard Stevens 所著《TCP/IP 详解》卷3 14.5 小节）

原文 This is what is described in section 14.5, listen Backlog Queue in W. Richard Stevens’ classic textbook TCP/IP Illustrated, Volume 3.

需要说明的是，Stevens 实际上说明了 BSD 是使用了两个队列，但是它们的总大小限制被设为 backlog 参数的值（并不一定总为此值）），BSD 的逻辑在条目 1 中的描述：

队列大小限制了 [..] 半连接队列中连接数量 [..] 和 [..] 全连接队列中连接数量之和

原文： The queue limit applies to the sum of […] the number of entries on the incomplete connection queue […] and […] the number of entries on the completed connection queue […].

在 Linux 上则不同，在 listen 调用的 man 手册提到：

The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog.

自 Linux 2.2 之后，backlog 参数设定全连接队列的大小限制，相应的，半连接队列的大小改由 /proc/sys/net/ipv4/tcp_max_syn_backlog 指定。

那么问题来了，当 accept 队列已满，而此时有一个连接收到了 ACK 确认，需要从 SYN 队列移到 accept 队列的时候该咋整？相关代码在 net/ipv4/tcp_minisocks.c 的 tcp_check_req 函数中：

child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
if (child == NULL)
    goto listen_overflow;

上边的代码实际上调用了 net/ipv4/tcp_ipv4.c 的 tcp_v4_syn_recv_sock，此函数中重点摘要：

if (sk_acceptq_is_full(sk))
    goto exit_overflow;

此处对 accept 队列进行了检查。exit_overflow 代码段会做一些清理工作，更新 /proc/net/netstat 内的统计字段 ListenOverflows 和 ListenDrops，然后返回 NULL。该操作将会触发 tcp_check_req 函数内的 listen_overflow 代码段：

listen_overflow:
        if (!sysctl_tcp_abort_on_overflow) {
                inet_rsk(req)->acked = 1;
                return NULL;
        }

这段代码的意思是如果 /proc/sys/net/ipv4/tcp_abort_on_overflow 的内容不为 1 （如果为 1 则发送一个 RST 包），那么就啥也不做，即忽略这个请求。

简单说就是，在 Linux 里，accept 队列满的时候收到了新的 ACK 包，这个包将被忽略。这听起来有些奇怪，但是别忘了 SYN-RCVD 连接上有个计时器：如果未收到 ACK 包（该包被忽略也视为未收到），被动连接方会超时尝试重传 SYN/ACK 包，（重试次数由 /proc/sys/net/ipv4/tcp_synack_retries 指定，重试间隔时间由退避算法计算得出）。

下面的抓包 trace 显示了当一个 socket 连接达到 backlog 限制时的情况：

  0.000  127.0.0.1 -> 127.0.0.1  TCP 74 53302 > 9999 [SYN] Seq=0 Len=0
  0.000  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  0.000  127.0.0.1 -> 127.0.0.1  TCP 66 53302 > 9999 [ACK] Seq=1 Ack=1 Len=0
  0.000  127.0.0.1 -> 127.0.0.1  TCP 71 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  0.207  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  0.623  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  1.199  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  1.199  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 6#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
  1.455  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  3.123  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  3.399  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  3.399  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 10#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
  6.459  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  7.599  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  7.599  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 13#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 13.131  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
 15.599  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
 15.599  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 16#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 26.491  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
 31.599  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
 31.599  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 19#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 53.179  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
106.491  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
106.491  127.0.0.1 -> 127.0.0.1  TCP 54 9999 > 53302 [RST] Seq=1 Len=0

因为客户端（主动连接的一方）收到了多个 SYN/ACK 包，它会认为自己之前发送的 ACK 确认包丢失了，然后就会重新发送 ACK 确认（即上边有 TCP Dup ACK 标识的）。若服务端程序在 SYN/ACK 重试次数到达极限之前处理掉了积压队列中的连接（如释放掉 accept 队列中的一个连接），那么它就会去处理而非丢弃或拒绝客户端重发过来的 ACK 包，将此连接状态由 SYN-RCVD 更新为 ESTABLISHED ，并将此连接移到 accept 队列里。否则的话，客户端最终会收到一个 RST 包。（上边 trace 中的最后一条记录）

此 trace 也显示了另外一些有趣的东西。从客户端的角度来看，收到第一个 SYN/ACK 包的时候，给服务端回复了 ACK ，切到 ESTABLISHED 状态，客户端这边的连接就建立起来了。此时它已经可以发送数据了，但是服务端那边的连接还在 SYN 队列里，它肯定是收不到服务端的 ACK 确认的，这个数据包待会还得重传，这不是浪费么？好在 TCP 慢启动限制了此阶段 TCP 报文段的大小，减少了资源的浪费。

考虑另一种情况，在客户端已是 ESTABLISHED 状态，但服务端由于种种原因关掉了，没通知客户端，此时就是半打开连接，这个就依靠 TCP 的保活定时器来解决了，保活定时器的相关参数在 /etc/sysctl.conf

net.ipv4.tcp_keepalive_time = 7200 # 首次保活探测报文发送时间（秒） net.ipv4.tcp_keepalive_probes = 9 # 重试次数 net.ipv4.tcp_keepalive_intvl = 75 # 重试间隔时间（秒）

NOTE 原文说的比较局限，但我考虑导致半打开连接可能是多样的，说的就比较笼统一点，原文如下： On the other hand, if the client first waits for data from the server and the server never reduces the backlog, then the end result is that on the client side, the connection is in state ESTABLISHED, while on the server side, the connection is considered CLOSED. This means that we end up with a half-open connection!

还有一点上边我们未讨论到的地方，listen 函数的 man page 里建议当连接收到 SYN 包处理完成后就把连接丢到 SYN 队列里（除非队列已满）（我没找到这段话），这并不精确，我们来看net/ipv4/tcp_ipv4.c 中处理 SYN 包的 tcp_v4_conn_request 函数中的一段代码：

 /* Accept backlog is full. If we have already queued enough
         * of warm entries in syn queue, drop request. It is better than
         * clogging syn queue with openreqs with exponentially increasing
         * timeout.
         */
        if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
                goto drop;
        }

这段代码的意思是当 accept 队列满的时候，内核会强制设定 SYN 包的接收率，如果收到过多的 SYN 包，其中一些就会被丢弃。客户端不断重发 SYN，服务端不断丢弃。这个是因为 accept 队列满了导致的，最终就跟 BSD 上使用 backlog 限制 SYN 队列与 accept 队列长度之和的效果是类似的。

总的来说，Linux 的设计实现要优于 BSD的。我们看看《TCP/IP详解》作者 Stevens 一个有趣的看法：

当已完成连接队列被填满(例如，服务器进程或服务器主机非常繁忙时，进程执行 accept 调用不够快，不能及时清空队列)，或未完成连接队列被填满时，将达到 backlog 的上限。当服务器主机与客户主机的往返时间较长，而相比较而言，新的连接请求到达较快，那么服务器就要面对上述的后一个问题，因为一个新的 SYN 占用队列中的一个记录项的时间是一次往返时间。[…]

已完成连接队列绝大部分时间是空的，因为当有连接进入这个队列时，只要服务器程序的 accept 调用一返回，这条连接便会马上从该队列中被取走。

Stevens 的解决方案就是简单地调大 backlog 的值。但问题是调整 backlog 之后，我们的程序可能会因为更多的连接带来处理压力，请求往返时间之类的网络情况也会导致请求队列积压。Linux 的实现方案有效地分离了这两个问题：应用程序本身只需要关心 backlog ，根据程序的处理能力调整 backlog，通过 accept 调用从 accept 队列获取连接，避免 accept 队列被填满。系统管理员关心 /proc/sys/net/ipv4/tcp_max_syn_backlog，根据网络状况作出不同的调整，减少 SYN 队列积压。

Note：上边是我理解的翻译，这段话原话：

The solution suggested by Stevens is simply to increase the backlog. The problem with this is that it assumes that an application is expected to tune the backlog not only taking into account how it intents to process newly established incoming connections, but also in function of traffic characteristics such as the round-trip time. The implementation in Linux effectively separates these two concerns: the application is only responsible for tuning the backlog such that it can call accept fast enough to avoid filling the accept queue); a system administrator can then tune /proc/sys/net/ipv4/tcp_max_syn_backlog based on traffic characteristics.

读后笔记

我在服务器（CentOS Linux release 7.3.1611 - Linux version 3.10.0-514.6.1.el7.x86_64）上查了一下 listen 的 man 手册, 包含如下内容：

The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog. When syncookies are enabled there is no logical maximum length and this setting is ignored. See tcp(7) for more information.

If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128.

In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.

发现多了一段话："当 syncookies 启用的时候，不再有常规的大小设定且 backlog 的设定将失效"。

然后去 tcp 的 man 手册看了下关于 syncookies 的描述：

tcp_syncookies (Boolean; since Linux 2.2)

Enable TCP syncookies. The kernel must be compiled with CONFIG_SYN_COOKIES. Send out syncookies when the syn backlog queue of a socket overflows. The syncookies feature attempts to protect a socket from a SYN flood attack. This should be used as a last resort, if at all. This is a violation of the TCP protocol, and conflicts with other areas of TCP such as TCP extensions. It can cause problems for clients and relays. It is not recommended as a tuning mechanism for heavily loaded servers to help with overloaded or misconfigured conditions. For recommended alternatives see tcp_max_syn_backlog, tcp_synack_retries, and tcp_abort_on_overflow.

应该是跟应对 syn 攻击有关的，先不考虑这个。

然后在我本机(macOS 10.13.3 - Darwin bogon 17.4.0)查了一下 listen 的 man page，包含如下内容：

The backlog parameter defines the maximum length for the queue of pending connections. If a connection request arrives with the queue full, the client may receive an error with an indication of ECONNREFUSED. Alternatively, if the underlying protocol supports retransmission, the request may be ignored so that retries may succeed.

emmm.... 说下我自己的理解吧：

queue of pending connections 指的就是半连接队列与全连接队列，像上边说的，在 BSD 上 backlog 限制这两个队列大小之和，它们 "behave as a signle queue"，所以这里就用了一个模糊的 "queue of pending connections"。
"the client may receive an error with an indication of ECONNREFUSED" 说的是可能客户端会收到一个 RST 包，拒绝掉这个连接。
"if the underlying protocol supports retransmission, the request may be ignored so that retries may succeed." 这个说的是如果底层协议支持重传的话，请求将被忽略，然后重传的请求可能会成功。

重传 - RST 这些在 BSD（macOS）上具体是怎么做的，即使看不到代码，也可以做个试验抓包看看。

两个问题

评论区里有人看过后提出了一些问题，有两个感觉可以讨论。

问题1

评论区里一个哥们儿提到他自己做了一个实验，用 python 写了一个简单的服务器，listen() 的参数 backlog 设为 1 分别把这个程序跑在 macOS 和 Linux 上，然后用 nc 连接测试，发现能跟 Linux 完全建立俩连接，第三个一直重传 ACK。跟 macOS 只能建立一个，第二个一直重传 ACK。这是为啥呢？

另一位网友也挺关心这个问题，并整理了一下放到 StackOverflow 了。

问题链接：https://stackoverflow.com/questions/44237026/tcp-backlog-works-not-as-expected-in-linux

这个先留着，以后再讨论。

问题2

下边另一个哥们提了一个很具体的问题:

Is it just me, or the new implementation using two queues completely breaks the backpressure?

Imagine your service is under heavy load and cannot keep up with incoming requests. For me the most natural thing to do in this case is to start refusing some of the connections, allowing your load balancer to redispatch the requests to another server in the pool.

Implementation with two queues essentially takes away this ability from applications. Kernel SYN-ACKs all connections for you unconditionally, even if your listen backlog is already full.

The way tcp_abort_on_overflow option works makes no sense to me. When it's on, kernel resets the connection only after it moves to ESTABLISHED state on the client side. So by the time RST is sent client may have already started transmitting the data. I think this is a huge problem. Load balancers such as HAProxy do not even allow redispatching requests to another server if connection to the first server was successfully established.

With tcp_abort_on_overflow option off the situation is even worse. Unconditional SYN-ACK means that client almost never will hit the connect timeout. So all the time spent by connection waiting for a free slot in listen backlog while kernel just ignores it is accounted as "server timeout", which can be really high depending on application.

Tuning tcp_max_syn_backlog to achieve this kind of behaviour doesn't seem like a good idea.

然后博主指出，你应该考虑考虑在负载均衡节点做一些负载均衡策略，比方说最小连接数，如果一台机器负载较大，就不要往上边丢请求了。

博主果然一针见血，不过我觉得这个哥们理解问题还都有一定偏差。

这就是上边那个 Stevens 描述的那个问题啊，调大 backlog 的值？机器性能不够啊，所以才做负载均衡的嘛。

然后 tcp_abort_on_overflow 参数控制的是 accept 队列全满的时候，收到 ACK 时的处理。但是考虑服务器压力特别特别大，这时候 SYN 队列， accept 队列全满，这时 SYN 队列里，客户端不断发 SYN/ACK 请求，直接被发 RST 拒掉了。但是那还有连 SYN 队列都进不去的呢？这些 SYN 请求服务端无法处理，直接丢弃了。这些客户端因为一直收不到 ACK，会触发超时重传机制，这取决于客户端的重传策略了。

调整 tcp_max_syn_backlog 更不用说，调小了没什么用，客户端的 SYN 丢的更早罢了。调大了又进不去 accept 队列，只不过是进到 SYN 队列死的稍微晚一点罢了。

最后这哥们写了一个 iptables 规则

iptables -A INPUT -p tcp --syn --dport 8080 \
-m connlimit --connlimit-mask 0 --connlimit-above 100 \
-j REJECT --reject-with tcp-reset

超过 100 个连接直接 tcp-reset , 即发送 RST 包。也算一种解决方案吧。。。。。

但是，我还是觉得博主的方法比较好，本来嘛，你配负载均衡是图什么的？就是分摊请求的嘛，分摊策略做好多省事。这个规则在多台机器上配置，即使写个脚本，也难免有疏漏，比方说某天添加了一个节点，却忘了配置这个玩意，就出问题了不是。虽然这个连接数自由一点，可以根据机器的配置来配置不同的最大连接数，但是在负载均衡节点配置加权的均衡算法同样可以达到这个目的。

上边这些问题，先立一个 flag，以后再进一步查阅资料和做实验来解决吧（emm....优先级低，有生之年级别的吧），在这里先忽略掉。通过这篇文章，已经把 TCP 的 SYN 队列和 accept 队列，backlog 参数搞个差不多了，O了个K。

转载于:https://my.oschina.net/HY1024/blog/1632386