Python Crawler(2)Items and Pipelines

Python Crawler(2)Items and Pipelines

We can do Items.py as follow:
import scrapy


class QuoteItem(scrapy.Item):
# define the fields for your item here like: # name = scrapy.Field() text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()

class AuthorItem(scrapy.Item):
name = scrapy.Field()
desc = scrapy.Field()
birth = scrapy.Field()

We can do Pipelines.py as follow:

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

def __init__(self):
self.names = set()

def process_item(self, item, spider):
name = item['name'] + ' - Unique' if name in self.names:
raise DropItem("Duplicate item found: %s" % item['name'])
else:
self.names.add(name)
item['name'] = name
return item

We can also as multiple pipelines
class MongoPipeline(object):

def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db

@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)

def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]

def close_spider(self, spider):
self.client.close()

def process_item(self, item, spider):
collection_name = item.__class__.__name__
self.db[collection_name].insert(dict(item))
return item

ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}

Or in the settings.py
ITEM_PIPELINES = {
'tutorial.pipelines.DuplicatesPipeline': 300,
}

The pipeline actions will start from the low numbers.

Big sample Project
https://github.com/gnemoug/distribute_crawler

Deployment
https://scrapyd.readthedocs.io/en/latest/install.html
https://github.com/istresearch/scrapy-cluster

Install the scrapyd
>pip install scrapyd

Directly talk with Server side
https://scrapyd.readthedocs.io/en/latest/api.html

Clients
https://github.com/scrapy/scrapyd-client

Deploy
https://github.com/scrapy/scrapyd-client#scrapyd-deploy

Install the client
>pip install scrapyd-client

Start the Server
>scrapyd

Visit the console
http://localhost:6800/

Deploy my simple Things
>scrapyd-deploy
Packing version 1504042554
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1504042554", "spiders": 2, "node_name": "ip-10-10-21-215.ec2.internal"}

List Target
>scrapyd-deploy -l
default http://localhost:6800/

Possible Cluster Solution in the future
https://github.com/istresearch/scrapy-cluster

Try with Python 3.6 later

References:
https://github.com/gnemoug/distribute_crawler
https://www.douban.com/group/topic/38361104/
http://wiki.jikexueyuan.com/project/scrapy/item-pipeline.html
https://segmentfault.com/a/1190000009229896
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值