class Scheduler(BaseScheduler): (source)
Default Scrapy scheduler. This implementation also handles duplication filtering via the :setting:`dupefilter <DUPEFILTER_CLASS>`. This scheduler stores requests into several priority queues (defined by the :setting:`SCHEDULER_PRIORITY_QUEUE` setting). In turn, said priority queues are backed by either memory or disk based queues (respectively defined by the :setting:`SCHEDULER_MEMORY_QUEUE` and :setting:`SCHEDULER_DISK_QUEUE` settings). Request prioritization is almost entirely delegated to the priority queue. The only prioritization performed by this scheduler is using the disk-based queue if present (i.e. if the :setting:`JOBDIR` setting is defined) and falling back to the memory-based queue if a serialization error occurs. If the disk queue is not present, the memory one is used directly. :param dupefilter: An object responsible for checking and filtering duplicate requests. The value for the :setting:`DUPEFILTER_CLASS` setting is used by default. :type dupefilter: :class:`scrapy.dupefilters.BaseDupeFilter` instance or similar: any class that implements the `BaseDupeFilter` interface :param jobdir: The path of a directory to be used for persisting the crawl's state. The value for the :setting:`JOBDIR` setting is used by default. See :ref:`topics-jobs`. :type jobdir: :class:`str` or ``None`` :param dqclass: A class to be used as persistent request queue. The value for the :setting:`SCHEDULER_DISK_QUEUE` setting is used by default. :type dqclass: class :param mqclass: A class to be used as non-persistent request queue. The value for the :setting:`SCHEDULER_MEMORY_QUEUE` setting is used by default. :type mqclass: class :param logunser: A boolean that indicates whether or not unserializable requests should be logged. The value for the :setting:`SCHEDULER_DEBUG` setting is used by default. :type logunser: bool :param stats: A stats collector object to record stats about the request scheduling process. The value for the :setting:`STATS_CLASS` setting is used by default. :type stats: :class:`scrapy.statscollectors.StatsCollector` instance or similar: any class that implements the `StatsCollector` interface :param pqclass: A class to be used as priority queue for requests. The value for the :setting:`SCHEDULER_PRIORITY_QUEUE` setting is used by default. :type pqclass: class :param crawler: The crawler object corresponding to the current crawl. :type crawler: :class:`scrapy.crawler.Crawler`
Class Method | from |
Factory method, initializes the scheduler with arguments taken from the crawl settings |
Method | __init__ |
Undocumented |
Method | __len__ |
Return the total amount of enqueued requests |
Method | close |
(1) dump pending requests to disk if there is a disk queue (2) return the result of the dupefilter's ``close`` method |
Method | enqueue |
Unless the received request is filtered out by the Dupefilter, attempt to push it into the disk queue, falling back to pushing it into the memory queue. |
Method | has |
``True`` if the scheduler has enqueued requests, ``False`` otherwise |
Method | next |
Return a :class:`~scrapy.http.Request` object from the memory queue, falling back to the disk queue if the memory queue is empty. Return ``None`` if there are no more enqueued requests. |
Method | open |
(1) initialize the memory queue (2) initialize the disk queue if the ``jobdir`` attribute is a valid directory (3) return the result of the dupefilter's ``open`` method |
Instance Variable | crawler |
Undocumented |
Instance Variable | df |
Undocumented |
Instance Variable | dqclass |
Undocumented |
Instance Variable | dqdir |
Undocumented |
Instance Variable | dqs |
Undocumented |
Instance Variable | logunser |
Undocumented |
Instance Variable | mqclass |
Undocumented |
Instance Variable | mqs |
Undocumented |
Instance Variable | pqclass |
Undocumented |
Instance Variable | spider |
Undocumented |
Instance Variable | stats |
Undocumented |
Method | _dq |
Create a new priority queue instance, with disk storage |
Method | _dqdir |
Return a folder name to keep disk queue state at |
Method | _dqpop |
Undocumented |
Method | _dqpush |
Undocumented |
Method | _mq |
Create a new priority queue instance, with in-memory storage |
Method | _mqpush |
Undocumented |
Method | _read |
Undocumented |
Method | _write |
Undocumented |
Optional[ str]
= None, dqclass=None, mqclass=None, logunser: bool
= False, stats=None, pqclass=None, crawler: Optional[ Crawler]
= None):
(source)
¶
Undocumented
(1) dump pending requests to disk if there is a disk queue (2) return the result of the dupefilter's ``close`` method
Unless the received request is filtered out by the Dupefilter, attempt to push it into the disk queue, falling back to pushing it into the memory queue. Increment the appropriate stats, such as: ``scheduler/enqueued``, ``scheduler/enqueued/disk``, ``scheduler/enqueued/memory``. Return ``True`` if the request was stored successfully, ``False`` otherwise.
Return a :class:`~scrapy.http.Request` object from the memory queue, falling back to the disk queue if the memory queue is empty. Return ``None`` if there are no more enqueued requests. Increment the appropriate stats, such as: ``scheduler/dequeued``, ``scheduler/dequeued/disk``, ``scheduler/dequeued/memory``.
scrapy.core.scheduler.BaseScheduler.open
(1) initialize the memory queue (2) initialize the disk queue if the ``jobdir`` attribute is a valid directory (3) return the result of the dupefilter's ``open`` method