Core API

New in version 0.15.

This section documents the Scrapy core API, and it’s intended for developers ofextensions and middlewares.

Crawler API

The main entry point to Scrapy API is the Crawlerobject, passed to extensions through the from_crawler class method. Thisobject provides access to all Scrapy core components, and it’s the only way forextensions to access them and hook their functionality into Scrapy.

The Extension Manager is responsible for loading and keeping track of installedextensions and it’s configured through the EXTENSIONS setting whichcontains a dictionary of all available extensions and their order similar tohow you configure the downloader middlewares.

This is used by extensions & middlewares to access the Scrapy settingsof this crawler.

For an introduction on Scrapy settings see Settings.

For the API see Settings class.

  • signals
  • The signals manager of this crawler.

This is used by extensions & middlewares to hook themselves into Scrapyfunctionality.

For an introduction on signals see Signals.

For the API see SignalManager class.

  • stats
  • The stats collector of this crawler.

This is used from extensions & middlewares to record stats of theirbehaviour, or access stats collected by other extensions.

For an introduction on stats collection see Stats Collection.

For the API see StatsCollector class.

  • extensions
  • The extension manager that keeps track of enabled extensions.

Most extensions won’t need to access this attribute.

For an introduction on extensions and a list of available extensions onScrapy see Extensions.

  • engine
  • The execution engine, which coordinates the core crawling logicbetween the scheduler, downloader and spiders.

Some extension may want to access the Scrapy engine, to inspect ormodify the downloader and scheduler behaviour, although this is anadvanced use and this API is not yet stable.

  • spider
  • Spider currently being crawled. This is an instance of the spider classprovided while constructing the crawler, and it is created after thearguments given in the crawl() method.

  • crawl(*args, **kwargs)[source]

  • Starts the crawler by instantiating its spider class with the givenargs and kwargs arguments, while setting the execution engine inmotion.

Returns a deferred that is fired when the crawl is finished.

  • stop()[source]
  • Starts a graceful stop of the crawler and returns a deferred that isfired when the crawler is stopped.
  • class scrapy.crawler.CrawlerRunner(settings=None)[source]
  • This is a convenient helper class that keeps track of, manages and runscrawlers inside an already setup reactor.

The CrawlerRunner object must be instantiated with aSettings object.

This class shouldn’t be needed (since Scrapy is responsible of using itaccordingly) unless writing scripts that manually handle the crawlingprocess. See Run Scrapy from a script for an example.

  • crawl(crawler_or_spidercls, *args, **kwargs)[source]
  • Run a crawler with the provided arguments.

It will call the given Crawler’s crawl() method, whilekeeping track of it so it can be stopped later.

If crawler_or_spidercls isn’t a Crawlerinstance, this method will try to create one using this parameter asthe spider class given to it.

Returns a deferred that is fired when the crawling is finished.

Parameters:

  1. - **crawler_or_spidercls** ([<code>Crawler</code>](#scrapy.crawler.Crawler) instance,[<code>Spider</code>]($fb832e20e85f228c.md#scrapy.spiders.Spider) subclass or string) – already created crawler, or a spider classor spider’s name inside the project to create it
  2. - **args** ([_list_](https://docs.python.org/3/library/stdtypes.html#list)) – arguments to initialize the spider
  3. - **kwargs** ([_dict_](https://docs.python.org/3/library/stdtypes.html#dict)) – keyword arguments to initialize the spider
  • property crawlers
  • Set of crawlers started by crawl() and managed by this class.

  • createcrawler(_crawler_or_spidercls)[source]

  • Return a Crawler object.

    • If crawler_or_spidercls is a Crawler, it is returned as-is.
    • If crawler_or_spidercls is a Spider subclass, a new Crawleris constructed for it.
    • If crawler_or_spidercls is a string, this function findsa spider with this name in a Scrapy project (using spider loader),then creates a Crawler instance for it.
  • join()[source]
  • Returns a deferred that is fired when all managed crawlers havecompleted their executions.

  • stop()[source]

  • Stops simultaneously all the crawling jobs taking place.

Returns a deferred that is fired when they all have ended.

A class to run multiple scrapy crawlers in a process simultaneously.

This class extends CrawlerRunner by adding supportfor starting a reactor and handling shutdownsignals, like the keyboard interrupt command Ctrl-C. It also configurestop-level logging.

This utility should be a better fit thanCrawlerRunner if you aren’t running anotherreactor within your application.

The CrawlerProcess object must be instantiated with aSettings object.

Parameters:install_root_handler – whether to install root logging handler(default: True)

This class shouldn’t be needed (since Scrapy is responsible of using itaccordingly) unless writing scripts that manually handle the crawlingprocess. See Run Scrapy from a script for an example.

  • crawl(crawler_or_spidercls, *args, **kwargs)
  • Run a crawler with the provided arguments.

It will call the given Crawler’s crawl() method, whilekeeping track of it so it can be stopped later.

If crawler_or_spidercls isn’t a Crawlerinstance, this method will try to create one using this parameter asthe spider class given to it.

Returns a deferred that is fired when the crawling is finished.

Parameters:

  1. - **crawler_or_spidercls** ([<code>Crawler</code>](#scrapy.crawler.Crawler) instance,[<code>Spider</code>]($fb832e20e85f228c.md#scrapy.spiders.Spider) subclass or string) – already created crawler, or a spider classor spider’s name inside the project to create it
  2. - **args** ([_list_](https://docs.python.org/3/library/stdtypes.html#list)) – arguments to initialize the spider
  3. - **kwargs** ([_dict_](https://docs.python.org/3/library/stdtypes.html#dict)) – keyword arguments to initialize the spider
  • property crawlers
  • Set of crawlers started by crawl() and managed by this class.

  • createcrawler(_crawler_or_spidercls)

  • Return a Crawler object.

    • If crawler_or_spidercls is a Crawler, it is returned as-is.
    • If crawler_or_spidercls is a Spider subclass, a new Crawleris constructed for it.
    • If crawler_or_spidercls is a string, this function findsa spider with this name in a Scrapy project (using spider loader),then creates a Crawler instance for it.
  • join()
  • Returns a deferred that is fired when all managed crawlers havecompleted their executions.

  • start(stop_after_crawl=True)[source]

  • This method starts a reactor, adjusts its poolsize to REACTOR_THREADPOOL_MAXSIZE, and installs a DNS cachebased on DNSCACHE_ENABLED and DNSCACHE_SIZE.

If stop_after_crawl is True, the reactor will be stopped after allcrawlers have finished, using join().

Parameters:stop_after_crawl (boolean) – stop or not the reactor when allcrawlers have finished

  • stop()
  • Stops simultaneously all the crawling jobs taking place.

Returns a deferred that is fired when they all have ended.

Settings API

  • scrapy.settings.SETTINGS_PRIORITIES
  • Dictionary that sets the key name and priority level of the defaultsettings priorities used in Scrapy.

Each item defines a settings entry point, giving it a code name foridentification and an integer priority. Greater priorities take moreprecedence over lesser ones when setting and retrieving values in theSettings class.

  1. SETTINGS_PRIORITIES = {
  2. 'default': 0,
  3. 'command': 10,
  4. 'project': 20,
  5. 'spider': 30,
  6. 'cmdline': 40,
  7. }

For a detailed explanation on each settings sources, see:Settings.

  • scrapy.settings.getsettings_priority(_priority)[source]
  • Small helper function that looks up a given string priority in theSETTINGS_PRIORITIES dictionary and returns itsnumerical value, or directly returns a given numerical priority.

This object stores Scrapy settings for the configuration of internalcomponents, and can be used for any further customization.

It is a direct subclass and supports all methods ofBaseSettings. Additionally, after instantiationof this class, the new object will have the global default settingsdescribed on Built-in settings reference already populated.

  • class scrapy.settings.BaseSettings(values=None, priority='project')[source]
  • Instances of this class behave like dictionaries, but store prioritiesalong with their (key, value) pairs, and can be frozen (i.e. markedimmutable).

Key-value entries can be passed on initialization with the valuesargument, and they would take the priority level (unless values isalready an instance of BaseSettings, in whichcase the existing priority levels will be kept). If the priorityargument is a string, the priority name will be looked up inSETTINGS_PRIORITIES. Otherwise, a specific integershould be provided.

Once the object is created, new settings can be loaded or updated with theset() method, and can be accessed withthe square bracket notation of dictionaries, or with theget() method of the instance and itsvalue conversion variants. When requesting a stored key, the value with thehighest priority will be retrieved.

  • copy()[source]
  • Make a deep copy of current settings.

This method returns a new instance of the Settings class,populated with the same values and their priorities.

Modifications to the new object won’t be reflected on the originalsettings.

  • copy_to_dict()[source]
  • Make a copy of current settings and convert to a dict.

This method returns a new dict populated with the same valuesand their priorities as the current settings.

Modifications to the returned dict won’t be reflected on the originalsettings.

This method can be useful for example for printing settingsin Scrapy shell.

  • freeze()[source]
  • Disable further changes to the current settings.

After calling this method, the present state of the settings will becomeimmutable. Trying to change values through the set() method andits variants won’t be possible and will be alerted.

  • frozencopy()[source]
  • Return an immutable copy of the current settings.

Alias for a freeze() call in the object returned by copy().

  • get(name, default=None)[source]
  • Get a setting value without affecting its original type.

Parameters:

  1. - **name** (_string_) the setting name
  2. - **default** (_any_) the value to return if no setting is found
  • getbool(name, default=False)[source]
  • Get a setting value as a boolean.

1, '1', True` and 'True' return True,while 0, '0', False, 'False' and None return False.

For example, settings populated through environment variables set to'0' will return False when using this method.

Parameters:

  1. - **name** (_string_) the setting name
  2. - **default** (_any_) the value to return if no setting is found
  • getdict(name, default=None)[source]
  • Get a setting value as a dictionary. If the setting original type is adictionary, a copy of it will be returned. If it is a string it will beevaluated as a JSON dictionary. In the case that it is aBaseSettings instance itself, it will beconverted to a dictionary, containing all its current settings valuesas they would be returned by get(),and losing all information about priority and mutability.

Parameters:

  1. - **name** (_string_) the setting name
  2. - **default** (_any_) the value to return if no setting is found
  • getfloat(name, default=0.0)[source]
  • Get a setting value as a float.

Parameters:

  1. - **name** (_string_) the setting name
  2. - **default** (_any_) the value to return if no setting is found
  • getint(name, default=0)[source]
  • Get a setting value as an int.

Parameters:

  1. - **name** (_string_) the setting name
  2. - **default** (_any_) the value to return if no setting is found
  • getlist(name, default=None)[source]
  • Get a setting value as a list. If the setting original type is a list, acopy of it will be returned. If it’s a string it will be split by “,”.

For example, settings populated through environment variables set to'one,two' will return a list [‘one’, ‘two’] when using this method.

Parameters:

  1. - **name** (_string_) the setting name
  2. - **default** (_any_) the value to return if no setting is found
  • getpriority(name)[source]
  • Return the current numerical priority value of a setting, or None ifthe given name does not exist.

Parameters:name (string) – the setting name

  • getwithbase(name)[source]
  • Get a composition of a dictionary-like setting and its __BASE_counterpart.

Parameters:name (string) – name of the dictionary-like setting

  • maxpriority()[source]
  • Return the numerical value of the highest priority present throughoutall settings, or the numerical value for default fromSETTINGS_PRIORITIES if there are no settingsstored.

  • set(name, value, priority='project')[source]

  • Store a key/value attribute with a given priority.

Settings should be populated before configuring the Crawler object(through the configure() method),otherwise they won’t have any effect.

Parameters:

  1. - **name** (_string_) the setting name
  2. - **value** (_any_) the value to associate with the setting
  3. - **priority** (_string__ or _[_int_](https://docs.python.org/3/library/functions.html#int)) – the priority of the setting. Should be a key of[<code>SETTINGS_PRIORITIES</code>](#scrapy.settings.SETTINGS_PRIORITIES) or an integer
  • setmodule(module, priority='project')[source]
  • Store settings from a module with a given priority.

This is a helper function that callsset() for every globally declareduppercase variable of module with the provided priority.

Parameters:

  1. - **module** (_module object__ or __string_) the module or the path of the module
  2. - **priority** (_string__ or _[_int_](https://docs.python.org/3/library/functions.html#int)) – the priority of the settings. Should be a key of[<code>SETTINGS_PRIORITIES</code>](#scrapy.settings.SETTINGS_PRIORITIES) or an integer
  • update(values, priority='project')[source]
  • Store key/value pairs with a given priority.

This is a helper function that callsset() for every item of valueswith the provided priority.

If values is a string, it is assumed to be JSON-encoded and parsedinto a dict with json.loads() first. If it is aBaseSettings instance, the per-key prioritieswill be used and the priority parameter ignored. This allowsinserting/updating settings with different priorities with a singlecommand.

Parameters:

  1. - **values** (dict or string or [<code>BaseSettings</code>](#scrapy.settings.BaseSettings)) the settings names and values
  2. - **priority** (_string__ or _[_int_](https://docs.python.org/3/library/functions.html#int)) – the priority of the settings. Should be a key of[<code>SETTINGS_PRIORITIES</code>](#scrapy.settings.SETTINGS_PRIORITIES) or an integer

SpiderLoader API

  • class scrapy.spiderloader.SpiderLoader[source]
  • This class is in charge of retrieving and handling the spider classesdefined across the project.

Custom spider loaders can be employed by specifying their path in theSPIDER_LOADER_CLASS project setting. They must fully implementthe scrapy.interfaces.ISpiderLoader interface to guarantee anerrorless execution.

  • fromsettings(_settings)[source]
  • This class method is used by Scrapy to create an instance of the class.It’s called with the current project settings, and it loads the spidersfound recursively in the modules of the SPIDER_MODULESsetting.

Parameters:settings (Settings instance) – project settings

  • load(spider_name)[source]
  • Get the Spider class with the given name. It’ll look into the previouslyloaded spiders for a spider class with name spider_name and will raisea KeyError if not found.

Parameters:spider_name (str) – spider class name

  • list()[source]
  • Get the names of the available spiders in the project.

  • findby_request(_request)[source]

  • List the spiders’ names that can handle the given request. Will try tomatch the request’s url against the domains of the spiders.

Parameters:request (Request instance) – queried request

Signals API

  • class scrapy.signalmanager.SignalManager(sender=_Anonymous)[source]
    • connect(receiver, signal, **kwargs)[source]
    • Connect a receiver function to a signal.

The signal can be any object, although Scrapy comes with somepredefined signals that are documented in the Signalssection.

Parameters:

  1. - **receiver** (_callable_) the function to be connected
  2. - **signal** ([_object_](https://docs.python.org/3/library/functions.html#object)) – the signal to connect to
  • disconnect(receiver, signal, **kwargs)[source]
  • Disconnect a receiver function from a signal. This has theopposite effect of the connect() method, and the argumentsare the same.

  • disconnectall(_signal, **kwargs)[source]

  • Disconnect all receivers from the given signal.

Parameters:signal (object) – the signal to disconnect from

  • sendcatch_log(_signal, **kwargs)[source]
  • Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connectedthrough the connect() method).

Returns a Deferred that gets fired once all signal handlersdeferreds were fired. Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connectedthrough the connect() method).

Stats Collector API

There are several Stats Collectors available under thescrapy.statscollectors module and they all implement the StatsCollector API defined by the StatsCollectorclass (which they all inherit from).

  • class scrapy.statscollectors.StatsCollector[source]
    • getvalue(_key, default=None)[source]
    • Return the value for the given stats key or default if it doesn’t exist.

    • get_stats()[source]

    • Get all stats from the currently running spider as a dict.

    • setvalue(_key, value)[source]

    • Set the given value for the given stats key.

    • setstats(_stats)[source]

    • Override the current stats with the dict passed in stats argument.

    • incvalue(_key, count=1, start=0)[source]

    • Increment the value of the given stats key, by the given count,assuming the start value given (when it’s not set).

    • maxvalue(_key, value)[source]

    • Set the given value for the given key only if current value for thesame key is lower than value. If there is no current value for thegiven key, the value is always set.

    • minvalue(_key, value)[source]

    • Set the given value for the given key only if current value for thesame key is greater than value. If there is no current value for thegiven key, the value is always set.

    • clear_stats()[source]

    • Clear all stats.

The following methods are not part of the stats collection api but insteadused when implementing custom stats collectors:

  • openspider(_spider)[source]
  • Open the given spider for stats collection.

  • closespider(_spider)[source]

  • Close the given spider. After this is called, no more specific statscan be accessed or collected.