关于python爬虫分布式爬虫的一些记载-优快云博客

本文介绍如何使用Scrapy-Redis实现Python爬虫的分布式部署。通过修改配置文件settings.py来启用Scrapy-Redis提供的去重过滤器和调度器，并设置连接到Redis服务器的相关参数。此外，还介绍了如何在爬虫文件中继承RedisCrawlSpider基类来实现爬虫逻辑。

2019独角兽企业重金招聘Python工程师标准>>>

首先要安装scrapy-redis，其实scrapy和分布式在编辑上不会差别很大，差别有如下几点

settings.py中设置:

# 使用了Scrapy-redis里的去重组件，不使用Scrapy默认的去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用了Scrapy-redis里的调度器组件，不使用Scrapy默认的调度
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 允许暂停，redis请求记录不丢失
SCHEDULER_PERSIST = True
# 下面三个选择启用一个，启用第一个最好
# 默认的Scrapy-redis请求(按优先级顺序)队列形式
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
# 队列形式，请求先进先出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
# 栈形式，请求先进后出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

# scrapy_redis.pipelines.RedisPipeline 支持将数据存储到Redis数据库里，必须启动
ITEM_PIPELINES = {
   'dongguan.pipelines.DongguanPipeline': 500,
   'scrapy_redis.pipelines.RedisPipeline': 400,

}
# 这是指定redis数据库的地址和端口号
REDIS_HOST = '192.168.99.1'
REDIS_PORT = 6379

在爬虫文件里设置:

# from scrapy.spiders import CrawlSpider, Rule
from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider

# class DongguanquestionSpider(CrawlSpider):
class DongguanquestionSpider(RedisCrawlSpider):
    name = 'dongguanquestion'
    redis_key = 'DongguanquestionSpider:start_urls'
    # allowed_domains = ['wz.sun0769.com']
    # start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0']
    pagelinks = LinkExtractor(allow='page=\d+')
    questionlinks = LinkExtractor(allow='/question/\d+/\d+.shtml')
    rules = (
        Rule(pagelinks),
        Rule(questionlinks, callback='parse_item'),
    )
    # 动态域
    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))
        super(DongguanquestionSpider, self).__init__(*args, **kwargs)