RUN__IT# pipeline管道的使用

最新推荐文章于 2024-08-18 16:00:14 发布

RUN IT

最新推荐文章于 2024-08-18 16:00:14 发布

阅读量463

点赞数

CC 4.0 BY-SA版权

分类专栏： python爬虫

本文链接：https://blog.youkuaiyun.com/RUN__IT/article/details/100114528

python爬虫专栏收录该内容

16 篇文章

订阅专栏

本文介绍了Django项目的pipeline配置及开启方法，主要集中在settings.py中的设置。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

pipeline中常用的方法：

process_item(self,item,spider):实现对item数据的处理
open_spider(self, spider): 在爬虫开启的时候仅执行一次
close_spider(self, spider): 在爬虫关闭的时候仅执行一次

import json
from pymongo import MongoClient

class xxxFilePipeline(object):
    def open_spider(self, spider):  # 在爬虫开启的时候仅执行一次
        if spider.name == 'xxx':
            self.f = open('json.txt', 'a', encoding='utf-8')

    def close_spider(self, spider):  # 在爬虫关闭的时候仅执行一次
        if spider.name == 'xxx':
            self.f.close()

    def process_item(self, item, spider):
        if spider.name == 'xxx':
            self.f.write(json.dumps(dict(item), ensure_ascii=False, indent=2) + ',\n')
        return item  # 不return的情况下，另一个权重较低的pipeline将不会获得item




class xxxMongoPipeline(object):
    def open_spider(self, spider):  # 在爬虫开启的时候仅执行一次
        if spider.name == 'xxx':
            con = MongoClient(host='127.0.0.1', port=27017) # 实例化mongoclient
            self.collection = con.xxx.teachers # 创建数据库名为xxx,集合名为teachers的集合操作对象

    def process_item(self, item, spider):
        if spider.name == 'xxx':
            self.collection.insert(dict(item)) # 此时item对象需要先转换为字典,再插入
        # 不return的情况下，另一个权重较低的pipeline将不会获得item
        return item

开启管道

在settings.py设置开启pipeline

......
ITEM_PIPELINES = {
    'myspider.pipelines.ItcastFilePipeline': 400, # 400表示权重
    'myspider.pipelines.ItcastMongoPipeline': 500,
}
......

1.不同的pipeline可以处理不同爬虫的数据，通过spider.name属性来区分
2.不同的pipeline能够对一个或多个爬虫进行不同的数据处理的操作，比如一个进3.行数据清洗，一个进行数据的保存
4.同一个管道类也可以处理不同爬虫的数据，通过spider.name属性来区分

5.pipeline在setting中键表示位置(即pipeline在项目中的位置可以自定义)，值表示6.距离引擎的远近，越近数据会越先经过
7.有多个pipeline的时候，process_item的方法必须return item,否则后一个pipeline取到的数据为None值
8.pipeline中process_item的方法必须有，否则item没有办法接受和处理
9.process_item方法接受item和spider，其中spider表示当前传递item过来的spider