细解RetryMiddleware

本文详细介绍了如何在Scrapy爬虫框架中重写RetryMiddleware中间件,以实现对不稳定代理的自动删除、日志记录及请求重试功能。通过设置特定的配置参数,开发者能够定制化错误处理流程,提高爬虫的稳定性和效率。

 


 

重写scrapy中间件RetryMiddleware

在爬取得过程中难免会遇到各种错误,如timeout或者404。而且在用ip代理池时,不是所有的代理都是稳定的,所以对于失败的代理我们需要做一些处理,例如删除操作。而由于不稳定代理引起的请求我们需要重新发起。这时候就有必要重写RetryMiddleware,来实现一些自己想要的操作。

  • 理解RetryMiddleware源码
  • 重写RetryMiddleware



 

RetryMiddleware部分源码

class RetryMiddleware(object):

    # 当遇到以下Exception时进行重试
    EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, ConnectionRefusedError, ConnectionDone, ConnectError, ConnectionLost, TCPTimedOutError, ResponseFailed, IOError, TunnelError)

    def __init__(self, settings):
        '''
        这里涉及到了settings.py文件中的几个量
        RETRY_ENABLED: 用于开启中间件,默认为TRUE
        RETRY_TIMES: 重试次数, 默认为2
        RETRY_HTTP_CODES: 遇到哪些返回状态码需要重试, 一个列表,默认为[500, 503, 504, 400, 408]
        RETRY_PRIORITY_ADJUST:调整相对于原始请求的重试请求优先级,默认为-1
        '''
        if not settings.getbool('RETRY_ENABLED'):
            raise NotConfigured
        self.max_retry_times = settings.getint('RETRY_TIMES')
        self.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))
        self.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST')

    def process_response(self, request, response, spider):
        # 在之前构造的request中可以加入meta信息dont_retry来决定是否重试    
        if request.meta.get('dont_retry', False):
            return response

        # 检查状态码是否在列表中,在的话就调用_retry方法进行重试
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            # 在此处进行自己的操作,如删除不可用代理,打日志等
            return self._retry(request, reason, spider) or response
        return response

    def process_exception(self, request, exception, spider):
        # 如果发生了Exception列表中的错误,进行重试
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            # 在此处进行自己的操作,如删除不可用代理,打日志等
            return self._retry(request, exception, spider)
自己想要的操作在上面已经标出来了,只要在其中加入自己的代码就可以满足大部分要求。具体如下:
class MyRetryMiddleware(RetryMiddleware):
    logger = logging.getLogger(__name__)

    def delete_proxy(self, proxy):
        if proxy:
            # delete proxy from proxies pool


    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            # 删除该代理
            self.delete_proxy(request.meta.get('proxy', False))
            time.sleep(random.randint(3, 5))
            self.logger.warning('返回值异常, 进行重试...')
            return self._retry(request, reason, spider) or response
        return response


    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            # 删除该代理
            self.delete_proxy(request.meta.get('proxy', False))
            time.sleep(random.randint(3, 5))
            self.logger.warning('连接异常, 进行重试...')

            return self._retry(request, exception, spider)

其中_retry方法有如下作用: 
1、对request.meta中的retry_time进行+1 
2、将retry_times和max_retry_time进行比较,如果前者小于等于后者,利用copy方法在原来的request上复制一个新request,并更新其retry_times,并将dont_filter设为True来防止因url重复而被过滤。

# Scrapy settings for nepu_spider project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = "nepu_spider" SPIDER_MODULES = ["nepu_spider.spiders"] NEWSPIDER_MODULE = "nepu_spider.spiders" # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 16 # Configure a delay for requests for the same website (default: 0) DOWNLOAD_DELAY = 1 # The download delay setting will honor only one of: CONCURRENT_REQUESTS_PER_DOMAIN = 8 CONCURRENT_REQUESTS_PER_IP = 8 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7", } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html # SPIDER_MIDDLEWARES = { # "nepu_spider.middlewares.NepuSpiderSpiderMiddleware": 543, # } # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html # EXTENSIONS = { # "scrapy.extensions.telnet.TelnetConsole": None, # } # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'nepu_spider.pipelines.NepuSpiderPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html AUTOTHROTTLE_ENABLED = True # The initial download delay AUTOTHROTTLE_START_DELAY = 1 # The maximum download delay to be set in case of high latencies AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0 # Enable showing throttling stats for every response received: AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 86400 # 24 小时缓存 HTTPCACHE_DIR = "httpcache" HTTPCACHE_IGNORE_HTTP_CODES = [500, 502, 503, 504, 408] HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage" # Set settings whose default value is deprecated to a future-proof value REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8" # 自定义数据库配置(用于 pipeline.py 中使用) DB_DRIVER = 'ODBC Driver 17 for SQL Server' DB_SERVER = 'LAPTOP-6KK5U1P0' # 修改为你自己的服务器名或IP DB_NAME = 'ScrapyDB' DB_USER = 'sa' DB_PASSWORD = '123456' 改吧
07-06
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值