简介
Scrapy的Scheduler是scrapy中服务存储、调度Request,其中包括了对Request的去重、优先级设置等。
1、BaseSchedulerMeta
class BaseSchedulerMeta(type):
"""
Metaclass to check scheduler classes against the necessary interface
"""
def __instancecheck__(cls, instance):
return cls.__subclasscheck__(type(instance))
def __subclasscheck__(cls, subclass):
return (
hasattr(subclass, "has_pending_requests") and callable(subclass.has_pending_requests)
and hasattr(subclass, "enqueue_request") and callable(subclass.enqueue_request)
and hasattr(subclass, "next_request") and callable(subclass.next_request)
)
定义一个元类,用于使用isinstance和issubclass方法,判断是不是实例和子类
2、BaseScheduler
@classmethod
def from_crawler(cls, crawler: Crawler):
"""
Factory method which receives the current :class:`~scrapy.crawler.Crawler` object as argument.
"""
return cls()
def open(self, spider: Spider) -> Optional[Deferred]:
"""
Called when the spider is opened by the engine. It receives the spider
instance as argument and it's useful to execute initialization code.
:param spider: the spider object for the current crawl
:type spider: :class:`~scrapy.spiders.Spider`
"""
pass
def close(self, reason: str) -> Optional[Deferred]:
"""
Called when the spider is closed by the engine. It receives the reason why the crawl
finished as argument and it's useful to execute cleaning code.
:param reason: a string which describes the reason why the spider was closed
:type reason: :class:`str`
"""
pass
@abstractmethod
def has_pending_requests(self) -> bool:
"""
``True`` if the scheduler has enqueued requests, ``False`` otherwise
"""
raise NotImplementedError()
@abstractmethod
def enqueue_request(self, request: Request) -> bool:
"""
Process a request received by the engine.
Return ``True`` if the request is stored correctly, ``False`` otherwise.
If ``False``, the engine will fire a ``request_dropped`` signal, and
will not make further attempts to schedule the request at a later time.
For reference, the default Scrapy scheduler returns ``False`` when the
request is rejected by the dupefilter.
"""
raise NotImplementedError()
@abstractmethod
def next_request(self) -> Optional[Request]:
"""
Return the next :class:`~scrapy.http.Request` to be processed, or ``None``
to indicate that there are no requests to be considered ready at the moment.
Returning ``None`` implies that no request from the scheduler will be sent
to the downloader in the current reactor cycle. The engine will continue
calling ``next_request`` until ``has_pending_requests`` is ``False``.
"""
raise NotImplementedError()
定义Schedule的基类,有一下方法
from_crawler 工厂方法,用于创建实例
open 不是必须重写的方法,打开调度器
close 不是必须重写的方法,关闭调度器
has_pending_requests 必须重写,判断调度器还是否有任务
enqueue_request 必须重写,添加任务到调度器,返回是否添加成功
next_request 必须重写,从调度器中获取任务,没有任务返回None
3、ScrapyPriorityQueue
任务优先级队列,主要作用是根据不同优先级,维护不同的任务队列,进行任务的添加、获取等操作
class ScrapyPriorityQueue:
"""A priority queue implemented using multiple internal queues (typically,
FIFO queues). It uses one internal queue for each priority value. The internal
queue must implement the following methods:
* push(obj)
* pop()
* close()
* __len__()
Optionally, the queue could provide a ``peek`` method, that should return the
next object to be returned by ``pop``, but without removing it from the queue.
``__init__`` method of ScrapyPriorityQueue receives a downstream_queue_cls
argument, which is a class used to instantiate a new (internal) queue when
a new priority is allocated.
Only integer priorities should be used. Lower numbers are higher
priorities.

本文介绍Scrapy中的调度器组件,包括其基本概念、核心类如BaseSchedulerMeta和BaseScheduler的功能与实现细节,以及ScrapyPriorityQueue的工作原理。此外,还深入探讨了Scrapy默认的Scheduler类,并解释了其如何通过内存和磁盘队列来管理和调度请求。
最低0.47元/天 解锁文章
3万+

被折叠的 条评论
为什么被折叠?



