官网网址:https://github.com/rmax/scrapy-redis
官网项目:example-project文件夹
1、对example-project文件夹解析【外面的文件】
(1)docker-compose.yml
redis:
image: redis
ports:
- "6379:6379" #redis端口号
crawler:
build: .
links:
- redis:localhost
(2)Dockerfile
#@IgnoreInspection BashAddShebang FROM python:2.7-onbuild #用的python版本 ENTRYPOINT ["scrapy"] CMD ["crawl", "dmoz"] #涉及的命令
(3)process_items.py
注:官网实例中的process_items.py是一个范例,该范例是从redis中拿数据到本地(MongoDB或MySQL),该示例不能直接使用,因为数据库不同,数据结构也不一样【之后会对该文档进行修改,再使用】,其可以单独执行。若数据已经存到了redis数据库,则可以单独执行process_items.py,则把数据拿到本地。
from __future__ import print_function, unicode_literals import argparse import json import logging import pprint import sys import time from scrapy_redis import get_redis logger = logging.getLogger('process_items') def process_items(r, keys, timeout, limit=0, log_every=1000, wait=.1): """Process items from a redis queue. Parameters: r : Redis==>Redis connection instance. keys : list==>List of keys to read the items from. timeout: int==>Read timeout. """ limit = limit or float('inf') processed = 0 while processed < limit: ret = r.blpop(keys, timeout) if ret is None: time.sleep(wait) continue source, data = ret try: item = json.loads(data) except Exception: logger.exception("Failed to load item:\n%r", pprint.pformat(data)) continue try: name = item.get('name') or item.get('title') url = item.get('url') or item.get('link') logger.debug("[%s] Processing item: %s <%s>", source, name, url) except KeyError: logger.exception("[%s] Failed to process item:\n%r", source, pprint.pformat(item)) continue processed += 1 if processed % log_every == 0: logger.info("Processed %s items", processed) def main(): parser = argparse.ArgumentParser(description=__doc__) parser.add_argument('key', help="Redis key where items are stored") parser.add_argument('--host') parser.add_argument('--port') parser.add_argument('--timeout', type=int, default=5) parser.add_argument('--limit', type=int, default=0) parser.add_argument('--progress-every', type=int, default=100) parser.add_argument('-v', '--verbose', action='store_true') args = parser.parse_args() params = {} if args.host: params['host'] = args.host if args.port: params['port'] = args.port logging.basicConfig(level=logging.DEBUG if args.verbose else logging.INFO) r = get_redis(**params) host = r.connection_pool.get_connection('info').host logger.info("Waiting for items in '%s' (server: %s)", args.key, host) kwargs = { 'keys': [args.key], 'timeout': args.timeout, 'limit': args.limit, 'log_every': args.progress_every, } try: process_items(r, **kwargs) retcode = 0 # ok except KeyboardInterrupt: retcode = 0 # ok except Exception: logger.exception("Unhandled exception") retcode = 2 return retcode if __name__ == '__main__': sys.exit(main())
(4)README.rst ===>项目说明
(5)requirements.txt
scrapy scrapy-redis
(6)scrapy.cfg ===>基础设置
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # http://doc.scrapy.org/topics/scrapyd.html [settings] default = example.settings [deploy] #url = http://localhost:6800/ project = example
2、重点examples文件夹
【类似于scrapy startproject example 新建一个example项目】
2-1、example文件夹外包括:items.py、pipelines.py、settings.py
(1)items.py
from scrapy.item import Item, Field from scrapy.loader import ItemLoader from scrapy.loader.processors import MapCompose, TakeFirst, Join class ExampleItem(Item): #和之前定义Item相同,最常用! name = Field() description = Field() link = Field() crawled = Field() spider = Field() url = Field() class ExampleLoader(ItemLoader): #ItemLoader用的不广泛,但是也可以用来存储Scrapy项目的数据 default_item_class = ExampleItem default_input_processor = MapCompose(lambda s: s.strip()) default_output_processor = TakeFirst() description_out = Join()
(2)pipelines.py
from datetime import datetime
class ExamplePipeline(object):
def process_item(self, item, spider): #process_item()方法不能少
item["crawled"] = datetime.utcnow()
item["spider"] = spider.name
return item
(3)settings.py
注:REDIS_HOST和REDIS_PORT需要设置为master端的IP地址
SPIDER_MODULES = ['example.spiders']
NEWSPIDER_MODULE = 'example.spiders'
USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
ITEM_PIPELINES = {
'example.pipelines.ExamplePipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400,
}
LOG_LEVEL = 'DEBUG'
DOWNLOAD_DELAY = 1
REDIS_HOST='192.168.21.64' #redis数据库的host必须是master端的IP地址
REDIS_PORT=6378
2-2、example/spiders包括:dmoz.py、mycrawl_redis.py、myspider_redis.py
(1)dmoz.py #很简单的一个CrawlSpider爬虫
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class DmozSpider(CrawlSpider): name = 'dmoz' allowed_domains = ['dmoz.org'] start_urls = ['http://www.dmoz.org/'] rules = [ Rule(LinkExtractor( restrict_css=('.top-cat', '.sub-cat', '.cat-item') ), callback='parse_directory', follow=True), ] def parse_directory(self, response): for div in response.css('.title-and-desc'): yield { 'name': div.css('.site-title::text').extract_first(), 'description': div.css('.site-descr::text').extract_first().strip(), 'link': div.css('a::attr(href)').extract_first(), }
此时运行:C:\Users\Administrator\Desktop\test-project\example>scrapy crawl dmoz 【salve端(师弟电脑)】
此时完成爬取,因为它没有继承redis数据库,没有继承redis相关的爬虫类,所以单纯的运行就可以执行了。结果显示如下:
【爬取成功!】
注:将该程序暂停,再执行程序,会接着上次的请求去爬虫!!!【因为使用scrapy-redis组件,项目停止不会丢失文件数据,则什么时候爬都可以】
且此时master端(我的电脑)的redis数据库有爬取的数据;
【表明:salve端可以把爬取的数据存储到master端的数据库中】
【因为salve端中的爬虫文件中的settings.py文件,将REDIS_HOST设置为master端的IP】
(2)myspider_redis.py
1)因为myspider_redis.py继承了RedisSpider,所以是分布式爬虫,当slave端(师弟的电脑)运行scrapy crawl myspider_redis时,会处于等待状态!
2)此时需要在master端(我的电脑)中的redis发送爬虫指令:lpush myspider: start_urls http://www.dmoz.org/ 【注:这边的网址可以修改为你想要爬取的页面】
from scrapy_redis.spiders import RedisSpider class MySpider(RedisSpider): """Spider that reads urls from redis queue (myspider:start_urls).""" name = 'myspider_redis' redis_key = 'myspider:start_urls' def __init__(self, *args, **kwargs): # Dynamically define the allowed domains list. domain = kwargs.pop('domain', '') self.allowed_domains = filter(None, domain.split(',')) super(MySpider, self).__init__(*args, **kwargs) def parse(self, response): return { 'name': response.css('title::text').extract_first(), 'url': response.url, }
注:scrapy-redis分布式的爬虫,运行时,使用的是scrapy runspider python程序
此时运行:C:\Users\Administrator\Desktop\test-project\example>scrapy runspider myspider_redis.py 【salver端(师弟电脑)】
1)此时爬虫(salver端)处于等待状态:
【因此:此时redis请求队列中是没有请求的,处于等待状态,需要在master端(我的电脑)给它发一个指令】
2)此时需要在master端的redis(我的电脑)发送爬虫指令:
,
3)则此时salve端的爬虫(师弟电脑)显示:
【注:开始爬取】
(3)mycrawler_redis.py
from scrapy.spiders import Rule from scrapy.linkextractors import LinkExtractor from scrapy_redis.spiders import RedisCrawlSpider class MyCrawler(RedisCrawlSpider): name = 'mycrawler_redis' redis_key = 'mycrawler:start_urls' rules = ( Rule(LinkExtractor(), callback='parse_page', follow=True), ) def __init__(self, *args, **kwargs): # Dynamically define the allowed domains list. domain = kwargs.pop('domain', '') self.allowed_domains = filter(None, domain.split(',')) super(MyCrawler, self).__init__(*args, **kwargs) def parse_page(self, response): return { 'name': response.css('title::text').extract_first(), 'url': response.url, }
【注:与2-2 myspider_redis.py文件相同,就是多了一些crawlspider中的rules】
1)此时运行:C:\Users\Administrator\Desktop\test-project\example>scrapy runspider mycrawler_redis.py 【salver端(师弟电脑)】
2)此时爬虫(salver端)处于等待状态:
【因此:此时redis请求队列中是没有请求的,处于等待状态,需要在master端(我的电脑)给它发一个指令】
3)此时需要在master端的redis(我的电脑)发送爬虫指令:lpush mycrawler:starturls http://www.dmoz.org/
4)则此时salve端的爬虫(师弟电脑)显示:
【注:开始爬取】
-------------------------------------------完成-------------------------------------------------------------
总结1:执行方式
- 1、通过runspider方法执行爬虫的py文件(也可以分次执行多条),爬虫(们)将处于等待准备状态:
- scrapy runspider myspider_redis.py 【slaver端】
- 2、在master端的redis-cli端输入push指令,参考格式:
- redis>lpush myspider:strat_urls http://www.dmoz.org/ 【master端】
- 3、slaver端爬虫获取到请求,开始爬虫。 【slaver端】
总结2:Scrapy-Redis分布式
其有两种父类,RedisSpider和RedisCrawlSpider,用法与scarpy里的spider和crawlspider一样,只不过多了redis。
(1)多了redis_key,同时取消了strat_urls。【是将start_urls放在了redis_key中】
(2)自动获取域的范围
(3)回调函数处理和scrapy中相同