首先要安装scrapy-redis,其实scrapy和分布式在编辑上不会差别很大,差别有如下几点
- settings.py中设置:
# 使用了Scrapy-redis里的去重组件,不使用Scrapy默认的去重 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用了Scrapy-redis里的调度器组件,不使用Scrapy默认的调度 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 允许暂停,redis请求记录不丢失 SCHEDULER_PERSIST = True # 下面三个选择启用一个,启用第一个最好 # 默认的Scrapy-redis请求(按优先级顺序)队列形式 SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue" # 队列形式,请求先进先出 #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue" # 栈形式,请求先进后出 #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
# scrapy_redis.pipelines.RedisPipeline 支持将数据存储到Redis数据库里,必须启动 ITEM_PIPELINES = { 'dongguan.pipelines.DongguanPipeline': 500, 'scrapy_redis.pipelines.RedisPipeline': 400, } # 这是指定redis数据库的地址和端口号 REDIS_HOST = '192.168.99.1' REDIS_PORT = 6379
- 在爬虫文件里设置:
# from scrapy.spiders import CrawlSpider, Rule from scrapy.spiders import Rule from scrapy_redis.spiders import RedisCrawlSpider
# class DongguanquestionSpider(CrawlSpider): class DongguanquestionSpider(RedisCrawlSpider): name = 'dongguanquestion' redis_key = 'DongguanquestionSpider:start_urls' # allowed_domains = ['wz.sun0769.com'] # start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0'] pagelinks = LinkExtractor(allow='page=\d+') questionlinks = LinkExtractor(allow='/question/\d+/\d+.shtml') rules = ( Rule(pagelinks), Rule(questionlinks, callback='parse_item'), ) # 动态域 def __init__(self, *args, **kwargs): # Dynamically define the allowed domains list. domain = kwargs.pop('domain', '') self.allowed_domains = filter(None, domain.split(',')) super(DongguanquestionSpider, self).__init__(*args, **kwargs)
- 执行代码
python2 -m scrapy runspider dongguanquestion.py
在Ubuntu上要这样
sudo python2 -m scrapy runspider dongguanquestion.py