Scrapy Pipeline之使用专门的Twisted客户端（以Redis缓存为例）

最新推荐文章于 2025-08-25 12:09:10 发布

bluespacezero

最新推荐文章于 2025-08-25 12:09:10 发布

阅读量3.4k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： scrapy 网络爬虫

本文链接：https://blog.youkuaiyun.com/Q_AN1314/article/details/51210283

scrapy 同时被 2 个专栏收录

40 篇文章

订阅专栏

网络爬虫

37 篇文章

订阅专栏

本文介绍了如何在Scrapy中使用Twisted客户端与Redis交互，通过RedisPipeline实现分布式缓存，避免重复请求已处理的地址编码。文章详细讲解了连接Redis、查询和存储键值对的过程，以及在Scrapy中启用和配置该pipeline的步骤。

Scrapy可以和其他使用了专门的Twisted的客户端交互，例如，如果我们想把它和MongoDB一起使用，搜索一下”MongoDB Python”，然后会找到PyMongo，这是个阻塞/同步的库，不应该和Twisted一起使用，除非我们还使用了线程来处理阻塞的操作。如果我们搜索一下”MongoDB Twisted Python”，会找到txmongo，它和Twisted及Scrapy就配合得很好。通常情况下，关于Twisted客户端的社区很小，但是仍然要比我们自己来写要好。我们接下来就会通过这样的一个Twisted客户端来使用Redis。

读/写Redis的Pipeline

Google Geocoding API限制的是IP，你可能会有多个IP（例如有很多服务器），并且想要避免请求另一台机器已经查找过的，或者之前的运行中已经看到过的地址的编码。

我们可以使用Redis键值对作为一个分布式的缓存dict。谷歌一下”Redis Twisted”，发现了txredisapi这个库。它和其他库不同的地方在于，它不仅仅是一个同步的Python库的简单包装，而是一个使用了reactor.connectTCP()来连接Redis的Twisted库，并且它实现了Twisted的协议等等。我们可以像使用其他库一样来使用它，但是它的效率在一个Twisted应用中显然要高得多。可以通过pip(sudo pip install txredisapi dj_redis_url)来安装它和一个工具库dj_redis_url，它用来解析一个Redis配置的URL。

下面是RedisPipeline的代码：

import json

import dj_redis_url
import txredisapi

from scrapy.exceptions import NotConfigured
from twisted.internet import defer
from scrapy import signals


class RedisCache(object):
    """A pipeline that uses a Redis server to cache values"""

    @classmethod
    def from_crawler(cls, crawler):
        """Create a new instance and pass it Redis' url and namespace"""

        # Get redis URL
        redis_url = crawler.settings.get('REDIS_PIPELINE_URL', None)

        # If doesn't exist, disable
        if not redis_url:
            raise NotConfigured

        redis_nm = crawler.settings.get('REDIS_PIPELINE_NS', 'ADDRESS_CACHE')

        return cls(crawler, redis_url, redis_nm)

    def __init__(self, crawler, redis_url, redis_nm):
        """Store configuration, open connection and register callback"""

        # Store the url and the namespace for future reference
        self.redis_url = redis_url
        self.redis_nm = redis_nm

        # Report connection error only once
        self.report_connection_error = True

        # Parse redis URL and try to initialize a connection
        args = RedisCache.parse_redis_url(redis_url)
        self.connection = txredisapi.lazyConnectionPool(connectTimeout=5,
                                                        replyTimeout=5,
                                                        **args)

        # Connect the item_scraped signal
        crawler.signals.connect(self.item_scraped, signal=signals.item_scraped)

    @defer.inlineCallbacks
    def process_item(self, item, spider):
        """Looks up address in redis"""

        logger = spider.logger

        if "location" in item:
            # Set by previous step (spider or pipeline). Don't do anything
            defer.returnValue(item)
            return

        # The item has to have the address field set
        assert ("address" in item) and (len(item["address"]) > 0)

        # Extract the address from the item.
        address = item["address"][0]

        try:
            # Check Redis
            key = self.redis_nm + ":" + address

            value = yield self.connection.get(key)

            if value:
                # Set the value for this item
                item["location"] = json.loads(value)

        except txredisapi.ConnectionError:
            if self.report_connection_error:
                logger.error("Can't connect to Redis: %s" % self.redis_url)
                self.report_connection_error = False

        defer.returnValue(item)

    def item_scraped(self, item, spider):
        """
        This function inspects the item after it has gone through every
        pipeline stage and if there is some cache value to add it does so.
        """
        # Capture and encode the location and the address
        try:
            location = item["location"]
            value = json.dumps(location, ensure_ascii=False)
        except KeyError:
            return

        # Extract the address from the item.
        address = item["address"][0]

        key = self.redis_nm + ":" + address

        quiet = lambda failure: failure.trap(txredisapi.ConnectionError)

        # Store it in Redis asynchronously
        return self.connection.set(key, value).addErrback(quiet)

    @staticmethod
    def parse_redis_url(redis_url):
        """
        Parses redis url and prepares arguments for
        txredisapi.lazyConnectionPool()
        """

        params = dj_redis_url.parse(redis_url)

        conn_kwargs = {}
        conn_kwargs['host'] = params['HOST']
        conn_kwargs['password'] = params['PASSWORD']
        conn_kwargs['dbid'] = params['DB']
        conn_kwargs['port'] = params['PORT']

        # Remove items with empty values
        conn_kwargs = dict((k, v) for k, v in conn_kwargs.iteritems() if v)

        return conn_kwargs

为了连接Redis服务器，需要主机名、端口等等，这些都存储在一个URL里面，用parse_redis_url()方法来解析。redis_nm代表了键的前辍。然后使用txredisapi的laxyConnectPool()来连接数据库。

__init__()函数的最后一行代码很有意思。我们想做的是用这个pipeline来包装geo-pipeline。如果Redis中没有想要的值，geo-pipeline就会像之前那样使用API来对请求地址的编码。之后，我们要有一种方法来把这些键值对缓存到Redis中，这就要通过关联signals.item_scaped()方法来实现，因为item_scraped信号是在Item经过所有的Pipeline处理、并且没有被丢弃之后才会产生，所以关联这个信号是没问题的。

这个缓存只是很简单地查询并且记录了所有Item的地址和location值。这对Redis来说是行得通的，因为它一般是和爬虫运行在同一台服务器上，这就使得它速度很快。如果不是在同一台机器上，那可能还需要加上一个类似于之前的基于字典的缓存。

在process_item()函数中，我们先拿到地址，给它加上前辍，使用txredisapi connection的get()方法在Redis中查找它。我们在Redis中存储的是JSON对象，如果这个值在Redis中已经设置了，那就使用JSON来对它进行解码。

当一个Item到达了所有pipeline的末尾并且没有被Drop的时候，我们重新通过item_scaped信号捕捉到它，并把它的值存储到Redis中去。如果我们得到了一个location，找出对应的地址，加上前辍分别作为值和键传递给txredisapi connection的set()方法作为参数。你可能会注意到这里没有使用@defer.inlineCallbacks，因为在处理signals.item_scaped的时候是不支持的。这就意味着我们不能对connection.set()语句使用熟悉而又方便的yield了，不过我们还是可以返回一个Deferred对象的，Scrapy可以用这个对象来链接以后的信号监听器。在任何情况下，如果connection.set()函数不能建立起一个到Redis的连接，就会抛出一个异常。我们可以使用一个自定义的错误处理函数来捕捉这个异常，只需要在connection.set()返回的Deferred对象后加上errback即可，正如上面代码中item_scaped()函数最后一行所示。在这个错误处理函数中，我们先拿到failure对象，并且告诉它去捕捉任何的ConnectionError。这是这是Twisted API的一个很棒的特性，通过在可能抛出的异常上使用trap()，我们就可以以一种紧凑的方式忽略它们。

要启用这个pipeline，就要在ITEM_PIPELINES中加入这个pipeline，并且设置REDIS_PIPELINE_URL。需要注意的一点是，一定要给它设置一个比较优先的值，以使得它在geo-pipeline前面，否则就没用了：

ITEM_PIPELINES = { ...
    'properties.pipelines.redis.RedisCache': 300,
    'properties.pipelines.geo.GeoPipeline': 400,
    ...
}
REDIS_PIPELINE_URL = 'redis://redis:6379'

运行结果如下：

$ scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=100
...
INFO: Enabled item pipelines: TidyUp, RedisCache, GeoPipeline,
MysqlWriter, EsWriter
...
Scraped... 0.0 items/s, avg latency: 0.00 s, time in pipelines:
0.00 s
Scraped... 21.2 items/s, avg latency: 0.78 s, time in
pipelines: 0.15 s
Scraped... 24.2 items/s, avg latency: 0.82 s, time in
pipelines: 0.16 s
...
INFO: Dumping Scrapy stats: {...
    'geo_pipeline/already_set': 106,
    'item_scraped_count': 106,

我们可以看到GeoPipeline和RedisCache都是启用的，并且RedisCache在前面，也可以注意到geo_pipeline/already_set:106这个数据。这是GeoPipeline发现的从RedisCache中已经设置了location的Item的数量，在这种情况下，就不会再产生一个Google API的请求。如果RedisCache是空的，你就会发现一些键还是用Google API处理的。