Scrapy的中间件（一）

最新推荐文章于 2025-05-28 16:49:14 发布

wtftx

最新推荐文章于 2025-05-28 16:49:14 发布

阅读量1k

点赞数 3

CC 4.0 BY-SA版权

分类专栏： scrapy 框架

本文链接：https://blog.youkuaiyun.com/wtftx/article/details/90344846

本文介绍了Scrapy的中间件开发，包括代理、UA和Cookies中间件的实现。中间件用于在请求和响应之间定制数据处理，如更换代理IP、UA和管理Cookies。通过设置中间件，可以在请求前修改请求头和代理，以及处理登录状态，提升爬虫的灵活性和隐蔽性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

主要内容参考《Python爬虫开发从入门到实战》

中间件是Scrapy里面的一个核心概念。使用中间件可以在爬虫的请求发起之前或者请求返回之后对数据进行定制化修改，从而开发出适应不同情况的爬虫。“中间件”本质上就是在中途劫持数据，做一些修改再把数据传递出去。中间件主要用来辅助开发，

在Scrapy中有两种中间件：下载器中间件（Downloader Middleware）和爬虫中间件（Spider Middleware）。

下载器中间件 Downloader Middleware
Scrapy的官方文档中，对下载器中间件的解释如下。

下载器中间件是介于Scrapy的request/response处理的钩子框架，是用于全局修改Scrapy request和response的一个轻量、底层的系统。

其实就是：更换代理IP，更换Cookies，更换User-Agent，自动重试.

如果完全没有中间件，爬虫的流程如下图所示。
在这里插入图片描述
使用了中间件以后，爬虫的流程如下图所示。

开发代理中间件

在爬虫开发中，有时候每一次访问都需要随机选择一个代理IP来进行。中间件本身是一个Python的类，只要爬虫每次访问网站之前都先“经过”这个类，它就能给请求换新的代理IP，这样就能实现动态改变代理。在创建一个Scrapy工程以后，工程文件夹下会有一个middlewares.py文件：


from scrapy import signals

class Douban250SpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of th