Scrapy 源码分析 6 Scrapy的Scheduler

最新推荐文章于 2024-09-10 19:37:49 发布

原创

最新推荐文章于 2024-09-10 19:37:49 发布 · 933 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python

本文介绍Scrapy中的调度器组件，包括其基本概念、核心类如BaseSchedulerMeta和BaseScheduler的功能与实现细节，以及ScrapyPriorityQueue的工作原理。此外，还深入探讨了Scrapy默认的Scheduler类，并解释了其如何通过内存和磁盘队列来管理和调度请求。

简介

Scrapy的Scheduler是scrapy中服务存储、调度Request，其中包括了对Request的去重、优先级设置等。

1、BaseSchedulerMeta

class BaseSchedulerMeta(type):
    """
    Metaclass to check scheduler classes against the necessary interface
    """
    def __instancecheck__(cls, instance):
        return cls.__subclasscheck__(type(instance))

    def __subclasscheck__(cls, subclass):
        return (
            hasattr(subclass, "has_pending_requests") and callable(subclass.has_pending_requests)
            and hasattr(subclass, "enqueue_request") and callable(subclass.enqueue_request)
            and hasattr(subclass, "next_request") and callable(subclass.next_request)
        )

定义一个元类，用于使用isinstance和issubclass方法，判断是不是实例和子类

2、BaseScheduler

@classmethod
def from_crawler(cls, crawler: Crawler):
    """
    Factory method which receives the current :class:`~scrapy.crawler.Crawler` object as argument.
    """
    return cls()

def open(self, spider: Spider) -> Optional[Deferred]:
    """
    Called when the spider is opened by the engine. It receives the spider
    instance as argument and it's useful to execute initialization code.

    :param spider: the spider object for the current crawl
    :type spider: :class:`~scrapy.spiders.Spider`
    """
    pass

def close(self, reason: str) -> Optional[Deferred]:
    """
    Called when the spider is closed by the engine. It receives the reason why the crawl
    finished as argument and it's useful to execute cleaning code.

    :param reason: a string which describes the reason why the spider was closed
    :type reason: :class:`str`
    """
    pass

@abstractmethod
def has_pending_requests(self) -> bool:
    """
    ``True`` if the scheduler has enqueued requests, ``False`` otherwise
    """
    raise NotImplementedError()

@abstractmethod
def enqueue_request(self, request: Request) -> bool:
    """
    Process a request received by the engine.

    Return ``True`` if the request is stored correctly, ``False`` otherwise.

    If ``False``, the engine will fire a ``request_dropped`` signal, and
    will not make further attempts to schedule the request at a later time.
    For reference, the default Scrapy scheduler returns ``False`` when the
    request is rejected by the dupefilter.
    """
    raise NotImplementedError()

@abstractmethod
def next_request(self) -> Optional[Request]:
    """
    Return the next :class:`~scrapy.http.Request` to be processed, or ``None``
    to indicate that there are no requests to be considered ready at the moment.

    Returning ``None`` implies that no request from the scheduler will be sent
    to the downloader in the current reactor cycle. The engine will continue
    calling ``next_request`` until ``has_pending_requests`` is ``False``.
    """
    raise NotImplementedError()

定义Schedule的基类，有一下方法
from_crawler 工厂方法，用于创建实例
open 不是必须重写的方法，打开调度器
close 不是必须重写的方法，关闭调度器
has_pending_requests 必须重写，判断调度器还是否有任务
enqueue_request 必须重写，添加任务到调度器，返回是否添加成功
next_request 必须重写，从调度器中获取任务，没有任务返回None

3、ScrapyPriorityQueue

任务优先级队列，主要作用是根据不同优先级，维护不同的任务队列，进行任务的添加、获取等操作

class ScrapyPriorityQueue:
    """A priority queue implemented using multiple internal queues (typically,
    FIFO queues). It uses one internal queue for each priority value. The internal
    queue must implement the following methods:

        * push(obj)
        * pop()
        * close()
        * __len__()

    Optionally, the queue could provide a ``peek`` method, that should return the
    next object to be returned by ``pop``, but without removing it from the queue.

    ``__init__`` method of ScrapyPriorityQueue receives a downstream_queue_cls
    argument, which is a class used to instantiate a new (internal) queue when
    a new priority is allocated.

    Only integer priorities should be used. Lower numbers are higher
    priorities.