Scrapy 源码分析 6 Scrapy的Scheduler

本文介绍Scrapy中的调度器组件,包括其基本概念、核心类如BaseSchedulerMeta和BaseScheduler的功能与实现细节,以及ScrapyPriorityQueue的工作原理。此外,还深入探讨了Scrapy默认的Scheduler类,并解释了其如何通过内存和磁盘队列来管理和调度请求。

简介

Scrapy的Scheduler是scrapy中服务存储、调度Request,其中包括了对Request的去重、优先级设置等。


1、BaseSchedulerMeta

class BaseSchedulerMeta(type):
    """
    Metaclass to check scheduler classes against the necessary interface
    """
    def __instancecheck__(cls, instance):
        return cls.__subclasscheck__(type(instance))

    def __subclasscheck__(cls, subclass):
        return (
            hasattr(subclass, "has_pending_requests") and callable(subclass.has_pending_requests)
            and hasattr(subclass, "enqueue_request") and callable(subclass.enqueue_request)
            and hasattr(subclass, "next_request") and callable(subclass.next_request)
        )

定义一个元类,用于使用isinstance和issubclass方法,判断是不是实例和子类


2、BaseScheduler

@classmethod
def from_crawler(cls, crawler: Crawler):
    """
    Factory method which receives the current :class:`~scrapy.crawler.Crawler` object as argument.
    """
    return cls()

def open(self, spider: Spider) -> Optional[Deferred]:
    """
    Called when the spider is opened by the engine. It receives the spider
    instance as argument and it's useful to execute initialization code.

    :param spider: the spider object for the current crawl
    :type spider: :class:`~scrapy.spiders.Spider`
    """
    pass

def close(self, reason: str) -> Optional[Deferred]:
    """
    Called when the spider is closed by the engine. It receives the reason why the crawl
    finished as argument and it's useful to execute cleaning code.

    :param reason: a string which describes the reason why the spider was closed
    :type reason: :class:`str`
    """
    pass

@abstractmethod
def has_pending_requests(self) -> bool:
    """
    ``True`` if the scheduler has enqueued requests, ``False`` otherwise
    """
    raise NotImplementedError()

@abstractmethod
def enqueue_request(self, request: Request) -> bool:
    """
    Process a request received by the engine.

    Return ``True`` if the request is stored correctly, ``False`` otherwise.

    If ``False``, the engine will fire a ``request_dropped`` signal, and
    will not make further attempts to schedule the request at a later time.
    For reference, the default Scrapy scheduler returns ``False`` when the
    request is rejected by the dupefilter.
    """
    raise NotImplementedError()

@abstractmethod
def next_request(self) -> Optional[Request]:
    """
    Return the next :class:`~scrapy.http.Request` to be processed, or ``None``
    to indicate that there are no requests to be considered ready at the moment.

    Returning ``None`` implies that no request from the scheduler will be sent
    to the downloader in the current reactor cycle. The engine will continue
    calling ``next_request`` until ``has_pending_requests`` is ``False``.
    """
    raise NotImplementedError()

定义Schedule的基类,有一下方法
from_crawler 工厂方法,用于创建实例
open 不是必须重写的方法,打开调度器
close 不是必须重写的方法,关闭调度器
has_pending_requests 必须重写,判断调度器还是否有任务
enqueue_request 必须重写,添加任务到调度器,返回是否添加成功
next_request 必须重写,从调度器中获取任务,没有任务返回None


3、ScrapyPriorityQueue

任务优先级队列,主要作用是根据不同优先级,维护不同的任务队列,进行任务的添加、获取等操作

class ScrapyPriorityQueue:
    """A priority queue implemented using multiple internal queues (typically,
    FIFO queues). It uses one internal queue for each priority value. The internal
    queue must implement the following methods:

        * push(obj)
        * pop()
        * close()
        * __len__()

    Optionally, the queue could provide a ``peek`` method, that should return the
    next object to be returned by ``pop``, but without removing it from the queue.

    ``__init__`` method of ScrapyPriorityQueue receives a downstream_queue_cls
    argument, which is a class used to instantiate a new (internal) queue when
    a new priority is allocated.

    Only integer priorities should be used. Lower numbers are higher
    priorities.

    
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值