RUN__IT# pipeline管道的使用

本文介绍了Django项目的pipeline配置及开启方法,主要集中在settings.py中的设置。

pipeline中常用的方法:

process_item(self,item,spider):实现对item数据的处理
open_spider(self, spider): 在爬虫开启的时候仅执行一次
close_spider(self, spider): 在爬虫关闭的时候仅执行一次
import json
from pymongo import MongoClient

class xxxFilePipeline(object):
    def open_spider(self, spider):  # 在爬虫开启的时候仅执行一次
        if spider.name == 'xxx':
            self.f = open('json.txt', 'a', encoding='utf-8')

    def close_spider(self, spider):  # 在爬虫关闭的时候仅执行一次
        if spider.name == 'xxx':
            self.f.close()

    def process_item(self, item, spider):
        if spider.name == 'xxx':
            self.f.write(json.dumps(dict(item), ensure_ascii=False, indent=2) + ',\n')
        return item  # 不return的情况下,另一个权重较低的pipeline将不会获得item




class xxxMongoPipeline(object):
    def open_spider(self, spider):  # 在爬虫开启的时候仅执行一次
        if spider.name == 'xxx':
            con = MongoClient(host='127.0.0.1', port=27017) # 实例化mongoclient
            self.collection = con.xxx.teachers # 创建数据库名为xxx,集合名为teachers的集合操作对象

    def process_item(self, item, spider):
        if spider.name == 'xxx':
            self.collection.insert(dict(item)) # 此时item对象需要先转换为字典,再插入
        # 不return的情况下,另一个权重较低的pipeline将不会获得item
        return item

开启管道

在settings.py设置开启pipeline

......
ITEM_PIPELINES = {
    'myspider.pipelines.ItcastFilePipeline': 400, # 400表示权重
    'myspider.pipelines.ItcastMongoPipeline': 500,
}
......
1.不同的pipeline可以处理不同爬虫的数据,通过spider.name属性来区分
2.不同的pipeline能够对一个或多个爬虫进行不同的数据处理的操作,比如一个进3.行数据清洗,一个进行数据的保存
4.同一个管道类也可以处理不同爬虫的数据,通过spider.name属性来区分

5.pipeline在setting中键表示位置(即pipeline在项目中的位置可以自定义),值表示6.距离引擎的远近,越近数据会越先经过
7.有多个pipeline的时候,process_item的方法必须return item,否则后一个pipeline取到的数据为None值
8.pipeline中process_item的方法必须有,否则item没有办法接受和处理
9.process_item方法接受item和spider,其中spider表示当前传递item过来的spider
# frozen_string_literal: true require File.expand_path('../boot', __FILE__) require 'rails' # Pick the frameworks you want: require 'active_model/railtie' require 'active_job/railtie' require 'active_record/railtie' # require 'active_storage/engine' require 'action_controller/railtie' require 'action_mailer/railtie' require 'action_view/railtie' require 'action_cable/engine' # require 'sprockets/railtie' require 'rails/test_unit/railtie' Bundler.require(*Rails.groups) module RedmineApp class Application < Rails::Application # Settings in config/environments/* take precedence over those specified here. # Application configuration should go into files in config/initializers # -- all .rb files in that directory are automatically loaded. # Custom directories with classes and modules you want to be autoloadable. config.autoloader = :zeitwerk # Only load the plugins named here, in the order given (default is alphabetical). # :all can be used as a placeholder for all plugins not explicitly named. # config.plugins = [ :exception_notification, :ssl_requirement, :all ] config.active_record.store_full_sti_class = true config.active_record.default_timezone = :local config.active_record.yaml_column_permitted_classes = [ Symbol, ActiveSupport::HashWithIndifferentAccess, ActionController::Parameters ] config.action_mailer.delivery_job = "ActionMailer::MailDeliveryJob" # Set Time.zone default to the specified zone and make Active Record auto-convert to this zone. # Run "rake -D time" for a list of tasks for finding time zone names. Default is UTC. # config.time_zone = 'Central Time (US & Canada)' # The default locale is :en and all translations from config/locales/*.rb,yml are auto loaded. # config.i18n.load_path += Dir[Rails.root.join('my', 'locales', '*.{rb,yml}').to_s] # config.i18n.default_locale = :de config.i18n.enforce_available_locales = true config.i18n.fallbacks = true config.i18n.default_locale = 'en' # Configure the default encoding used in templates for Ruby 1.9. config.encoding = "utf-8" # Configure sensitive parameters which will be filtered from the log file. config.filter_parameters += [:password] config.action_mailer.perform_deliveries = false # Do not include all helpers config.action_controller.include_all_helpers = false # Add forgery protection config.action_controller.default_protect_from_forgery = true # Sets the Content-Length header on responses with fixed-length bodies config.middleware.insert_before Rack::Sendfile, Rack::ContentLength # Verify validity of user sessions config.redmine_verify_sessions = true # Specific cache for search results, the default file store cache is not # a good option as it could grow fast. A memory store (32MB max) is used # as the default. If you're running multiple server processes, it's # recommended to switch to a shared cache store (eg. mem_cache_store). # See http://guides.rubyonrails.org/caching_with_rails.html#cache-stores # for more options (same options as config.cache_store). config.redmine_search_cache_store = :memory_store # Configure log level here so that additional environment file # can change it (environments/ENV.rb would take precedence over it) config.log_level = Rails.env.production? ? :info : :debug config.session_store( :cookie_store, :key => '_redmine_session', :path => config.relative_url_root || '/', :same_site => :lax ) if File.exist?(File.join(File.dirname(__FILE__), 'additional_environment.rb')) instance_eval File.read(File.join(File.dirname(__FILE__), 'additional_environment.rb')) end end end 这个文件哪里是关于 静态资源的配置
最新发布
09-10
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值