class documentation

Default Scrapy scheduler. This implementation also handles duplication filtering via the :setting:`dupefilter <DUPEFILTER_CLASS>`. This scheduler stores requests into several priority queues (defined by the :setting:`SCHEDULER_PRIORITY_QUEUE` setting). In turn, said priority queues are backed by either memory or disk based queues (respectively defined by the :setting:`SCHEDULER_MEMORY_QUEUE` and :setting:`SCHEDULER_DISK_QUEUE` settings). Request prioritization is almost entirely delegated to the priority queue. The only prioritization performed by this scheduler is using the disk-based queue if present (i.e. if the :setting:`JOBDIR` setting is defined) and falling back to the memory-based queue if a serialization error occurs. If the disk queue is not present, the memory one is used directly. :param dupefilter: An object responsible for checking and filtering duplicate requests. The value for the :setting:`DUPEFILTER_CLASS` setting is used by default. :type dupefilter: :class:`scrapy.dupefilters.BaseDupeFilter` instance or similar: any class that implements the `BaseDupeFilter` interface :param jobdir: The path of a directory to be used for persisting the crawl's state. The value for the :setting:`JOBDIR` setting is used by default. See :ref:`topics-jobs`. :type jobdir: :class:`str` or ``None`` :param dqclass: A class to be used as persistent request queue. The value for the :setting:`SCHEDULER_DISK_QUEUE` setting is used by default. :type dqclass: class :param mqclass: A class to be used as non-persistent request queue. The value for the :setting:`SCHEDULER_MEMORY_QUEUE` setting is used by default. :type mqclass: class :param logunser: A boolean that indicates whether or not unserializable requests should be logged. The value for the :setting:`SCHEDULER_DEBUG` setting is used by default. :type logunser: bool :param stats: A stats collector object to record stats about the request scheduling process. The value for the :setting:`STATS_CLASS` setting is used by default. :type stats: :class:`scrapy.statscollectors.StatsCollector` instance or similar: any class that implements the `StatsCollector` interface :param pqclass: A class to be used as priority queue for requests. The value for the :setting:`SCHEDULER_PRIORITY_QUEUE` setting is used by default. :type pqclass: class :param crawler: The crawler object corresponding to the current crawl. :type crawler: :class:`scrapy.crawler.Crawler`

Class Method from_crawler Factory method, initializes the scheduler with arguments taken from the crawl settings
Method __init__ Undocumented
Method __len__ Return the total amount of enqueued requests
Method close (1) dump pending requests to disk if there is a disk queue (2) return the result of the dupefilter's ``close`` method
Method enqueue_request Unless the received request is filtered out by the Dupefilter, attempt to push it into the disk queue, falling back to pushing it into the memory queue.
Method has_pending_requests ``True`` if the scheduler has enqueued requests, ``False`` otherwise
Method next_request Return a :class:`~scrapy.http.Request` object from the memory queue, falling back to the disk queue if the memory queue is empty. Return ``None`` if there are no more enqueued requests.
Method open (1) initialize the memory queue (2) initialize the disk queue if the ``jobdir`` attribute is a valid directory (3) return the result of the dupefilter's ``open`` method
Instance Variable crawler Undocumented
Instance Variable df Undocumented
Instance Variable dqclass Undocumented
Instance Variable dqdir Undocumented
Instance Variable dqs Undocumented
Instance Variable logunser Undocumented
Instance Variable mqclass Undocumented
Instance Variable mqs Undocumented
Instance Variable pqclass Undocumented
Instance Variable spider Undocumented
Instance Variable stats Undocumented
Method _dq Create a new priority queue instance, with disk storage
Method _dqdir Return a folder name to keep disk queue state at
Method _dqpop Undocumented
Method _dqpush Undocumented
Method _mq Create a new priority queue instance, with in-memory storage
Method _mqpush Undocumented
Method _read_dqs_state Undocumented
Method _write_dqs_state Undocumented
@classmethod
def from_crawler(cls: Type[SchedulerTV], crawler) -> SchedulerTV: (source)

Factory method, initializes the scheduler with arguments taken from the crawl settings

def __init__(self, dupefilter, jobdir: Optional[str] = None, dqclass=None, mqclass=None, logunser: bool = False, stats=None, pqclass=None, crawler: Optional[Crawler] = None): (source)

Undocumented

def __len__(self) -> int: (source)

Return the total amount of enqueued requests

def close(self, reason: str) -> Optional[Deferred]: (source)

(1) dump pending requests to disk if there is a disk queue (2) return the result of the dupefilter's ``close`` method

def enqueue_request(self, request: Request) -> bool: (source)

Unless the received request is filtered out by the Dupefilter, attempt to push it into the disk queue, falling back to pushing it into the memory queue. Increment the appropriate stats, such as: ``scheduler/enqueued``, ``scheduler/enqueued/disk``, ``scheduler/enqueued/memory``. Return ``True`` if the request was stored successfully, ``False`` otherwise.

def has_pending_requests(self) -> bool: (source)

``True`` if the scheduler has enqueued requests, ``False`` otherwise

def next_request(self) -> Optional[Request]: (source)

Return a :class:`~scrapy.http.Request` object from the memory queue, falling back to the disk queue if the memory queue is empty. Return ``None`` if there are no more enqueued requests. Increment the appropriate stats, such as: ``scheduler/dequeued``, ``scheduler/dequeued/disk``, ``scheduler/dequeued/memory``.

def open(self, spider: Spider) -> Optional[Deferred]: (source)

(1) initialize the memory queue (2) initialize the disk queue if the ``jobdir`` attribute is a valid directory (3) return the result of the dupefilter's ``open`` method

Undocumented

Undocumented

Undocumented

Undocumented

Undocumented

logunser = (source)

Undocumented

Undocumented

Undocumented

Undocumented

Undocumented

Undocumented

def _dq(self): (source)

Create a new priority queue instance, with disk storage

def _dqdir(self, jobdir: Optional[str]) -> Optional[str]: (source)

Return a folder name to keep disk queue state at

def _dqpop(self) -> Optional[Request]: (source)

Undocumented

def _dqpush(self, request: Request) -> bool: (source)

Undocumented

def _mq(self): (source)

Create a new priority queue instance, with in-memory storage

def _mqpush(self, request: Request): (source)

Undocumented

def _read_dqs_state(self, dqdir: str) -> list: (source)

Undocumented

def _write_dqs_state(self, dqdir: str, state: list): (source)

Undocumented