Python Crawler(2)Items and Pipelines

最新推荐文章于 2025-12-05 17:02:52 发布

原创最新推荐文章于 2025-12-05 17:02:52 发布 · 193 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #json

Summary 同时被 2 个专栏收录

381 篇文章

订阅专栏

Scripts

299 篇文章

订阅专栏

Python Crawler(2)Items and Pipelines

We can do Items.py as follow:
import scrapy

class QuoteItem(scrapy.Item):
# define the fields for your item here like: # name = scrapy.Field() text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()

class AuthorItem(scrapy.Item):
name = scrapy.Field()
desc = scrapy.Field()
birth = scrapy.Field()

We can do Pipelines.py as follow:

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

def __init__(self):
self.names = set()

def process_item(self, item, spider):
name = item['name'] + ' - Unique' if name in self.names:
raise DropItem("Duplicate item found: %s" % item['name'])
else:
self.names.add(name)
item['name'] = name
return item

We can also as multiple pipelines
class MongoPipeline(object):

def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db

@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)

def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]

def close_spider(self, spider):
self.client.close()

def process_item(self, item, spider):
collection_name = item.__class__.__name__
self.db[collection_name].insert(dict(item))
return item

ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}

Or in the settings.py
ITEM_PIPELINES = {
'tutorial.pipelines.DuplicatesPipeline': 300,
}

The pipeline actions will start from the low numbers.

Big sample Project
https://github.com/gnemoug/distribute_crawler

Deployment
https://scrapyd.readthedocs.io/en/latest/install.html
https://github.com/istresearch/scrapy-cluster

Install the scrapyd
>pip install scrapyd

Directly talk with Server side
https://scrapyd.readthedocs.io/en/latest/api.html

Clients
https://github.com/scrapy/scrapyd-client

Deploy
https://github.com/scrapy/scrapyd-client#scrapyd-deploy

Install the client
>pip install scrapyd-client

Start the Server
>scrapyd

Visit the console
http://localhost:6800/

Deploy my simple Things
>scrapyd-deploy
Packing version 1504042554
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1504042554", "spiders": 2, "node_name": "ip-10-10-21-215.ec2.internal"}

List Target
>scrapyd-deploy -l
default http://localhost:6800/

Possible Cluster Solution in the future
https://github.com/istresearch/scrapy-cluster

Try with Python 3.6 later

References:
https://github.com/gnemoug/distribute_crawler
https://www.douban.com/group/topic/38361104/
http://wiki.jikexueyuan.com/project/scrapy/item-pipeline.html
https://segmentfault.com/a/1190000009229896