TIME_WAIT问题笔记(转)

本文深入探讨了TCP连接中TIME_WAIT状态的原理及其带来的问题,包括解决客户端连接耗尽、端口复用、连接管理等常见问题的方法,并提供了实践建议和优化策略。
转自 http://wiki.apache.org/HttpComponents/FrequentlyAskedConnectionManagementQuestions
1. Connections in TIME_WAIT State
After running your HTTP application, you use the netstat command and detect a lot of connections in state TIME_WAIT. Now you wonder why these connections are not cleaned up.

1.1. What is the TIME_WAIT State?
The TIME_WAIT state is a protection mechanism in TCP. The side that closes a socket connection orderly will keep the connection in state TIME_WAIT for some time, typically between 1 and 4 minutes. This happens after the connection is closed. It does not indicate a cleanup problem. The TIME_WAIT state protects against loss of data and data corruption. It is there to help you. For technical details, have a look at the Unix Socket FAQ, section 2.7.

1.2. Some Connections Go To TIME_WAIT, Others Not
If a connection is orderly closed by your application, it will go to the TIME_WAIT state. If a connection is orderly closed by the server, the server keeps it in TIME_WAIT and your client doesn't. If a connection is reset or otherwise dropped by your application in a non-orderly fashion, it will not go to TIME_WAIT.

Unfortunately, it will not always be obvious to you whether a connection is closed orderly or not. This is because connections are pooled and kept open for re-use by default. HttpClient 3.x, HttpClient 4, and also the standard Java HttpURLConnection do that for you. Most applications will simply execute requests, then read from the response stream, and finally close that stream.
Closing the response stream is not the same thing as closing the connection! Closing the response stream returns the connection to the pool, but it will be kept open if possible. This saves a lot of time if you send another request to the same host within a few seconds, or even minutes.

Connection pools have a limited number of connections. A pool may have 5 connections, or 100, or maybe only 1. When you send a request to a host, and there is no open connection to that host in the pool, a new connection needs to be opened. But if the pool is already full, an open connection has to be closed before a new one can be opened. In this case, the old connection will be closed orderly and go to the TIME_WAIT state.
When your application exits and the JVM terminates, the open connections in the pools will not be closed orderly. They are reset or cancelled, without going to TIME_WAIT. To avoid this, you should call the shutdown method of the connection pools your application is using before exiting. The standard Java HttpURLConnection has no public method to shutdown it's connection pool.

1.3. Running Out Of Ports
Some applications open and orderly close a lot of connections within a short time, for example when load-testing a server. A connection in state TIME_WAIT will prevent that port number from being re-used for another connection. That is not an error, it is the purpose of TIME_WAIT.

TCP is configured at the operating system level, not through Java. Your first action should be to increase the number of ephemeral ports on the machine. Windows in particular has a rather low default for the ephemeral ports. The PerformanceWiki has tuning tips for the common operating systems, have a look at the respective Network section.
Only if increasing the number of ephemeral ports does not solve your problem, you should consider decreasing the duration of the TIME_WAIT state. You probably have to reduce the maximum lifetime of IP packets, as the duration of TIME_WAIT is typically twice that timespan to allow for a round-trip delay. Be aware that this will affect all applications running on the machine. Don't ask us how to do it, we're not the experts for network tuning.

There are some ways to deal with the problem at the application level. One way is to send a "Connection: close" header with each request. That will tell the server to close the connection, so it goes to TIME_WAIT on the other side. Of course this also disables the keep-alive feature of connection pooling and thereby degrades performance. If you are running load tests against a server, the untypical behavior of your application may distort the test results. [[BR] Another way is to not orderly close connections. There is a trick to set SO_LINGER to a special value, which will cause the connection to be reset instead of orderly closed. Note that the HttpClient API will not support that directly, you'll have to extend or modify some classes to implement this hack.
Yet another way is to re-use ports that are still blocked by a connection in TIME_WAIT. You can do that by specifying the SO_REUSEADDR option when opening a socket. Java 1.4 introduced the method Socket.setReuseAddress for this purpose. You will have to extend or modify some classes of HttpClient for this too, but at least it's not a hack.

1.4. Further Reading
Unix Socket FAQ

java.net.Socket.setReuseAddress

Discussion on the HttpClient mailing list in December 2007

PerformanceWiki

netstat command line tool


http://www.softlab.ntua.gr/facilities/documentation/unix/unix-socket-faq/unix-socket-faq-2.html#ss2.7


2.7 Please explain the TIME_WAIT state.
Remember that TCP guarantees all data transmitted will be delivered, if at all possible. When you close a socket, the server goes into a TIME_WAIT state, just to be really really sure that all the data has gone through. When a socket is closed, both sides agree by sending messages to each other that they will send no more data. This, it seemed to me was good enough, and after the handshaking is done, the socket should be closed. The problem is two-fold. First, there is no way to be sure that the last ack was communicated successfully. Second, there may be "wandering duplicates" left on the net that must be dealt with if they are delivered.

Andrew Gierth ( andrewg@microlise.co.uk) helped to explain the closing sequence in the following usenet posting:

Assume that a connection is in ESTABLISHED state, and the client is about to do an orderly release. The client's sequence no. is Sc, and the server's is Ss. The pipe is empty in both directions.

Client Server
====== ======
ESTABLISHED ESTABLISHED
(client closes)
ESTABLISHED ESTABLISHED
<CTL=FIN+ACK><SEQ=Sc><ACK=Ss> ------->>
FIN_WAIT_1
<<-------- <CTL=ACK><SEQ=Ss><ACK=Sc+1>
FIN_WAIT_2 CLOSE_WAIT
<<-------- <CTL=FIN+ACK><SEQ=Ss><ACK=Sc+1> (server closes)
LAST_ACK
<CTL=ACK>,<SEQ=Sc+1><ACK=Ss+1> ------->>
TIME_WAIT CLOSED
(2*msl elapses...)
CLOSED
Note: the +1 on the sequence numbers is because the FIN counts as one byte of data. (The above diagram is equivalent to fig. 13 from RFC 793).

Now consider what happens if the last of those packets is dropped in the network. The client has done with the connection; it has no more data or control info to send, and never will have. But the server does not know whether the client received all the data correctly; that's what the last ACK segment is for. Now the server may or may not care whether the client got the data, but that is not an issue for TCP; TCP is a reliable rotocol, and must distinguish between an orderly connection close where all data is transferred, and a connection abort where data may or may not have been lost.

So, if that last packet is dropped, the server will retransmit it (it is, after all, an unacknowledged segment) and will expect to see a suitable ACK segment in reply. If the client went straight to CLOSED, the only possible response to that retransmit would be a RST, which would indicate to the server that data had been lost, when in fact it had not been.

(Bear in mind that the server's FIN segment may, additionally, contain data.)

DISCLAIMER: This is my interpretation of the RFCs (I have read all the TCP-related ones I could find), but I have not attempted to examine implementation source code or trace actual connections in order to verify it. I am satisfied that the logic is correct, though.

More commentarty from Vic:

The second issue was addressed by Richard Stevens ( rstevens@noao.edu, author of "Unix Network Programming", see 1.5 Where can I get source code for the book [book title]?). I have put together quotes from some of his postings and email which explain this. I have brought together paragraphs from different postings, and have made as few changes as possible.

From Richard Stevens ( rstevens@noao.edu):

If the duration of the TIME_WAIT state were just to handle TCP's full-duplex close, then the time would be much smaller, and it would be some function of the current RTO (retransmission timeout), not the MSL (the packet lifetime).

A couple of points about the TIME_WAIT state.

The end that sends the first FIN goes into the TIME_WAIT state, because that is the end that sends the final ACK. If the other end's FIN is lost, or if the final ACK is lost, having the end that sends the first FIN maintain state about the connection guarantees that it has enough information to retransmit the final ACK.
Realize that TCP sequence numbers wrap around after 2**32 bytes have been transferred. Assume a connection between A.1500 (host A, port 1500) and B.2000. During the connection one segment is lost and retransmitted. But the segment is not really lost, it is held by some intermediate router and then re-injected into the network. (This is called a "wandering duplicate".) But in the time between the packet being lost & retransmitted, and then reappearing, the connection is closed (without any problems) and then another connection is established between the same host, same port (that is, A.1500 and B.2000; this is called another "incarnation" of the connection). But the sequence numbers chosen for the new incarnation just happen to overlap with the sequence number of the wandering duplicate that is about to reappear. (This is indeed possible, given the way sequence numbers are chosen for TCP connections.) Bingo, you are about to deliver the data from the wandering duplicate (the previous incarnation of the connection) to the new incarnation of the connection. To avoid this, you do not allow the same incarnation of the connection to be reestablished until the TIME_WAIT state terminates. Even the TIME_WAIT state doesn't complete solve the second problem, given what is called TIME_WAIT assassination. RFC 1337 has more details.
The reason that the duration of the TIME_WAIT state is 2*MSL is that the maximum amount of time a packet can wander around a network is assumed to be MSL seconds. The factor of 2 is for the round-trip. The recommended value for MSL is 120 seconds, but Berkeley-derived implementations normally use 30 seconds instead. This means a TIME_WAIT delay between 1 and 4 minutes. Solaris 2.x does indeed use the recommended MSL of 120 seconds.
A wandering duplicate is a packet that appeared to be lost and was retransmitted. But it wasn't really lost ... some router had problems, held on to the packet for a while (order of seconds, could be a minute if the TTL is large enough) and then re-injects the packet back into the network. But by the time it reappears, the application that sent it originally has already retransmitted the data contained in that packet.

Because of these potential problems with TIME_WAIT assassinations, one should not avoid the TIME_WAIT state by setting the SO_LINGER option to send an RST instead of the normal TCP connection termination (FIN/ACK/FIN/ACK). The TIME_WAIT state is there for a reason; it's your friend and it's there to help you :-)

I have a long discussion of just this topic in my just-released "TCP/IP Illustrated, Volume 3". The TIME_WAIT state is indeed, one of the most misunderstood features of TCP.

I'm currently rewriting "Unix Network Programming" (see 1.5 Where can I get source code for the book [book title]?). and will include lots more on this topic, as it is often confusing and misunderstood.

An additional note from Andrew:

Closing a socket: if SO_LINGER has not been called on a socket, then close() is not supposed to discard data. This is true on SVR4.2 (and, apparently, on all non-SVR4 systems) but apparently not on SVR4; the use of either shutdown() or SO_LINGER seems to be required to guarantee delivery of all data.

----------------------------------------------------------------------
The "Address already in use: connect" error is caused by client socket
starvation on the machine(s) that SOAPtest is running on. By default
Windows does not allow you to set up client connections on ports above
5000. After a socket has been closed, the connection stays in a TIME_WAIT
state for another 2 minutes, after which the socket is freed and the
address can be reused. If more than 4000 connections (1024-5000) have been
made before those ports are freed (after 2 min. in TIME_WAIT), then
attempts to open a client socket on a port above 5000 will be rejected by
the operating system, which will cause Java to throw "Address already in
use: connect". This can be fixed by modifying the Windows registry entry
that controls this parameter:
1. Start Registry Editor: Start Menu > Run > Type in "regedit"
2. Locate the following key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
3. Right click on the Parameters folder and select New > DWORD Value
4. Name this new key "MaxUserPort"
4. Double click on the "MaxUserPort" key and change the value data to 65534
and select "Decimal" as the base.
5. Restart the machine.
(For more information see Microsoft Knowledge Base Article 196271)
import scrapy from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from scrapy import SeleniumRequest # Scrapy-Selenium 配置 SELENIUM_DRIVER_NAME = 'chrome' SELENIUM_DRIVER_EXECUTABLE_PATH = r"D:/ww/chromedriver/chromedriver.exe" # 替换为实际路径 SELENIUM_DRIVER_ARGUMENTS = ['--headless'] # 无头模式 DOWNLOADER_MIDDLEWARES = { 'scrapy_selenium.SeleniumMiddleware': 800, } class TaobaoItem: pass class TaobaoSpider(scrapy.Spider): name = 'taobao_spider' allowed_domains = ['taobao.com'] def start_requests(self): # 使用 Selenium 打开淘宝首页 yield SeleniumRequest( url="https://www.taobao.com", callback=self.parse, wait_time=10 ) def parse(self, response): driver = response.meta['driver'] # 在搜索框中输入关键词并提交 search_box = driver.find_element(By.ID, "q") search_box.send_keys("笔记本电脑") # 替换为需要搜索的商品 search_box.send_keys(Keys.RETURN) # 等待页面加载完成 driver.implicitly_wait(10) # 提取商品信息 items = driver.find_elements(By.CSS_SELECTOR, ".m-itemlist .item") for item in items: taobao_item = TaobaoItem() # 商品名称 product_name = item.find_element(By.CSS_SELECTOR, ".title").text taobao_item['product_name'] = product_name # 商品价格 price = item.find_element(By.CSS_SELECTOR, ".price").text taobao_item['price'] = price # 店铺名称 shop_name = item.find_element(By.CSS_SELECTOR, ".shop").text taobao_item['shop_name'] = shop_name # 销量 sales_volume = item.find_element(By.CSS_SELECTOR, ".deal-cnt").text taobao_item['sales_volume'] = sales_volume yield taobao_item # 处理分页 next_page = driver.find_elements(By.CSS_SELECTOR, ".pagination-next") if next_page and next_page[0].is_enabled(): next_page[0].click() driver.implicitly_wait(10) yield SeleniumRequest( url=driver.current_url, callback=self.parse, wait_time=10 )问题
最新发布
06-12
### 调试和修复淘宝爬虫代码中的问题 在使用 Scrapy 和 Selenium 编写的淘宝爬虫中,可能会遇到多种问题,例如动态页面加载、反爬机制、元素定位失败等。以下是针对这些问题的详细分析和解决方案。 #### 1. 动态页面加载问题 淘宝的页面通常通过 JavaScript 动态加载内容,Scrapy 默认无法处理这种情况。Selenium 可以模拟浏览器行为并等待页面加载完成[^1]。如果发现某些元素未正确加载,可以尝试以下方法: - **显式等待**:使用 `WebDriverWait` 等待特定条件满足后再继续执行。 ```python from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "q")) ) ``` - **调整等待时间**:增加 `wait_time` 参数以确保页面完全加载[^3]。 #### 2. 元素定位失败 如果在代码中使用了错误的选择器或元素不存在,会导致定位失败。可以通过以下方式排查: - **检查选择器**:确保 CSS 或 XPath 选择器与目标页面结构匹配。可以使用浏览器开发者工具验证选择器是否正确[^2]。 ```python items = driver.find_elements(By.CSS_SELECTOR, ".m-itemlist .item") if not items: raise Exception("未能找到商品元素") ``` #### 3. 反爬机制 淘宝具有较强的反爬机制,可能导致 IP 被封禁或请求被拦截。以下是应对策略: - **设置请求头**:模拟真实用户行为,添加 User-Agent 和其他必要头信息。 ```python headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" } ``` - **随机延迟**:在每次请求之间引入随机延迟以降低频率。 ```python import time import random time.sleep(random.uniform(1, 3)) ``` - **代理池**:使用代理 IP 池切换请求来源[^4]。 #### 4. 登录状态问题 如果需要抓取登录后的内容,可以使用 Selenium 模拟登录过程[^2]。 ```python # 输入用户名和密码 username_input = driver.find_element(By.ID, "username") password_input = driver.find_element(By.ID, "password") username_input.send_keys("your_username") password_input.send_keys("your_password") # 提交登录表单 login_button = driver.find_element(By.ID, "login-button") login_button.click() ``` #### 5. 数据存储问题 确保数据能够正确保存到目标文件中。可以使用 Scrapy 的内置导出功能或将数据写入数据库。 ```python # 将数据保存为 JSON 文件 scrapy crawl taobao_spider -o taobao_data.json ``` #### 6. 错误日志分析 启用 Scrapy 的日志功能以捕获运行时错误,并根据日志信息定位问题。 ```python # 在 settings.py 中配置日志 LOG_LEVEL = 'DEBUG' LOG_FILE = 'taobao_spider.log' ``` --- ### 示例代码片段 以下是一个改进后的爬虫代码片段,包含错误处理和优化措施: ```python import scrapy from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from scrapy_selenium import SeleniumRequest from ..items import TaobaoItem class TaobaoSpider(scrapy.Spider): name = 'taobao_spider' def start_requests(self): yield SeleniumRequest( url="https://www.taobao.com", callback=self.parse, wait_time=10 ) def parse(self, response): driver = response.meta['driver'] # 显式等待搜索框加载 search_box = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "q")) ) search_box.send_keys("笔记本电脑") search_box.send_keys(Keys.RETURN) # 等待商品列表加载 WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, ".m-itemlist .item")) ) items = driver.find_elements(By.CSS_SELECTOR, ".m-itemlist .item") for item in items: taobao_item = TaobaoItem() try: taobao_item['product_name'] = item.find_element(By.CSS_SELECTOR, ".title").text taobao_item['price'] = item.find_element(By.CSS_SELECTOR, ".price").text taobao_item['shop_name'] = item.find_element(By.CSS_SELECTOR, ".shop").text taobao_item['sales_volume'] = item.find_element(By.CSS_SELECTOR, ".deal-cnt").text except Exception as e: self.logger.error(f"解析商品失败: {e}") continue yield taobao_item # 处理分页 next_page = driver.find_elements(By.CSS_SELECTOR, ".pagination-next") if next_page and next_page[0].is_enabled(): next_page[0].click() yield SeleniumRequest( url=driver.current_url, callback=self.parse, wait_time=10 ) ``` --- ###
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值