scrapy.core.scheduler.Scheduler

class documentation

class Scheduler(BaseScheduler): (source)

Default Scrapy scheduler. This implementation also handles duplication filtering via the :setting:`dupefilter <DUPEFILTER_CLASS>`. This scheduler stores requests into several priority queues (defined by the :setting:`SCHEDULER_PRIORITY_QUEUE` setting). In turn, said priority queues are backed by either memory or disk based queues (respectively defined by the :setting:`SCHEDULER_MEMORY_QUEUE` and :setting:`SCHEDULER_DISK_QUEUE` settings). Request prioritization is almost entirely delegated to the priority queue. The only prioritization performed by this scheduler is using the disk-based queue if present (i.e. if the :setting:`JOBDIR` setting is defined) and falling back to the memory-based queue if a serialization error occurs. If the disk queue is not present, the memory one is used directly. :param dupefilter: An object responsible for checking and filtering duplicate requests. The value for the :setting:`DUPEFILTER_CLASS` setting is used by default. :type dupefilter: :class:`scrapy.dupefilters.BaseDupeFilter` instance or similar: any class that implements the `BaseDupeFilter` interface :param jobdir: The path of a directory to be used for persisting the crawl's state. The value for the :setting:`JOBDIR` setting is used by default. See :ref:`topics-jobs`. :type jobdir: :class:`str` or ``None`` :param dqclass: A class to be used as persistent request queue. The value for the :setting:`SCHEDULER_DISK_QUEUE` setting is used by default. :type dqclass: class :param mqclass: A class to be used as non-persistent request queue. The value for the :setting:`SCHEDULER_MEMORY_QUEUE` setting is used by default. :type mqclass: class :param logunser: A boolean that indicates whether or not unserializable requests should be logged. The value for the :setting:`SCHEDULER_DEBUG` setting is used by default. :type logunser: bool :param stats: A stats collector object to record stats about the request scheduling process. The value for the :setting:`STATS_CLASS` setting is used by default. :type stats: :class:`scrapy.statscollectors.StatsCollector` instance or similar: any class that implements the `StatsCollector` interface :param pqclass: A class to be used as priority queue for requests. The value for the :setting:`SCHEDULER_PRIORITY_QUEUE` setting is used by default. :type pqclass: class :param crawler: The crawler object corresponding to the current crawl. :type crawler: :class:`scrapy.crawler.Crawler`

Class Method	`from_crawler`	Factory method, initializes the scheduler with arguments taken from the crawl settings
Method	`__init__`	Undocumented
Method	`__len__`	Return the total amount of enqueued requests
Method	`close`	(1) dump pending requests to disk if there is a disk queue (2) return the result of the dupefilter's ``close`` method
Method	`enqueue_request`	Unless the received request is filtered out by the Dupefilter, attempt to push it into the disk queue, falling back to pushing it into the memory queue.
Method	`has_pending_requests`	``True`` if the scheduler has enqueued requests, ``False`` otherwise
Method	`next_request`	Return a :class:`~scrapy.http.Request` object from the memory queue, falling back to the disk queue if the memory queue is empty. Return ``None`` if there are no more enqueued requests.
Method	`open`	(1) initialize the memory queue (2) initialize the disk queue if the ``jobdir`` attribute is a valid directory (3) return the result of the dupefilter's ``open`` method
Instance Variable	`crawler`	Undocumented
Instance Variable	`df`	Undocumented
Instance Variable	`dqclass`	Undocumented
Instance Variable	`dqdir`	Undocumented
Instance Variable	`dqs`	Undocumented
Instance Variable	`logunser`	Undocumented
Instance Variable	`mqclass`	Undocumented
Instance Variable	`mqs`	Undocumented
Instance Variable	`pqclass`	Undocumented
Instance Variable	`spider`	Undocumented
Instance Variable	`stats`	Undocumented
Method	`_dq`	Create a new priority queue instance, with disk storage
Method	`_dqdir`	Return a folder name to keep disk queue state at
Method	`_dqpop`	Undocumented
Method	`_dqpush`	Undocumented
Method	`_mq`	Create a new priority queue instance, with in-memory storage
Method	`_mqpush`	Undocumented
Method	`_read_dqs_state`	Undocumented
Method	`_write_dqs_state`	Undocumented

@classmethod
def from_crawler(cls: Type[SchedulerTV], crawler) -> SchedulerTV: (source) ¶

overrides scrapy.core.scheduler.BaseScheduler.from_crawler

Factory method, initializes the scheduler with arguments taken from the crawl settings

def __init__(self, dupefilter, jobdir: Optional[str] = None, dqclass=None, mqclass=None, logunser: bool = False, stats=None, pqclass=None, crawler: Optional[Crawler] = None): (source) ¶

Undocumented

def __len__(self) -> int: (source) ¶

Return the total amount of enqueued requests

def close(self, reason: str) -> Optional[Deferred]: (source) ¶

overrides scrapy.core.scheduler.BaseScheduler.close

(1) dump pending requests to disk if there is a disk queue (2) return the result of the dupefilter's ``close`` method

def enqueue_request(self, request: Request) -> bool: (source) ¶

overrides scrapy.core.scheduler.BaseScheduler.enqueue_request

Unless the received request is filtered out by the Dupefilter, attempt to push it into the disk queue, falling back to pushing it into the memory queue. Increment the appropriate stats, such as: ``scheduler/enqueued``, ``scheduler/enqueued/disk``, ``scheduler/enqueued/memory``. Return ``True`` if the request was stored successfully, ``False`` otherwise.

def has_pending_requests(self) -> bool: (source) ¶

overrides scrapy.core.scheduler.BaseScheduler.has_pending_requests

``True`` if the scheduler has enqueued requests, ``False`` otherwise

def next_request(self) -> Optional[Request]: (source) ¶

overrides scrapy.core.scheduler.BaseScheduler.next_request

Return a :class:`~scrapy.http.Request` object from the memory queue, falling back to the disk queue if the memory queue is empty. Return ``None`` if there are no more enqueued requests. Increment the appropriate stats, such as: ``scheduler/dequeued``, ``scheduler/dequeued/disk``, ``scheduler/dequeued/memory``.

def open(self, spider: Spider) -> Optional[Deferred]: (source) ¶

overrides scrapy.core.scheduler.BaseScheduler.open

(1) initialize the memory queue (2) initialize the disk queue if the ``jobdir`` attribute is a valid directory (3) return the result of the dupefilter's ``open`` method

crawler = (source) ¶

Undocumented

df = (source) ¶

Undocumented

dqclass = (source) ¶

Undocumented

dqdir = (source) ¶

Undocumented

dqs = (source) ¶

Undocumented

logunser = (source) ¶

Undocumented