Core API
New in version 0.15.
This section documents the Scrapy core API, and it’s intended for developers ofextensions and middlewares.
Crawler API
The main entry point to Scrapy API is the Crawlerobject, passed to extensions through the from_crawler class method. Thisobject provides access to all Scrapy core components, and it’s the only way forextensions to access them and hook their functionality into Scrapy.
The Extension Manager is responsible for loading and keeping track of installedextensions and it’s configured through the EXTENSIONS setting whichcontains a dictionary of all available extensions and their order similar tohow you configure the downloader middlewares.
- class
scrapy.crawler.Crawler(spidercls, settings)[source] The Crawler object must be instantiated with a
scrapy.spiders.Spidersubclass and ascrapy.settings.Settingsobject.
This is used by extensions & middlewares to access the Scrapy settingsof this crawler.
For an introduction on Scrapy settings see Settings.
For the API see Settings class.
This is used by extensions & middlewares to hook themselves into Scrapyfunctionality.
For an introduction on signals see Signals.
For the API see SignalManager class.
This is used from extensions & middlewares to record stats of theirbehaviour, or access stats collected by other extensions.
For an introduction on stats collection see Stats Collection.
For the API see StatsCollector class.
Most extensions won’t need to access this attribute.
For an introduction on extensions and a list of available extensions onScrapy see Extensions.
engine- The execution engine, which coordinates the core crawling logicbetween the scheduler, downloader and spiders.
Some extension may want to access the Scrapy engine, to inspect ormodify the downloader and scheduler behaviour, although this is anadvanced use and this API is not yet stable.
spiderSpider currently being crawled. This is an instance of the spider classprovided while constructing the crawler, and it is created after thearguments given in the
crawl()method.crawl(*args, **kwargs)[source]- Starts the crawler by instantiating its spider class with the given
argsandkwargsarguments, while setting the execution engine inmotion.
Returns a deferred that is fired when the crawl is finished.
stop()[source]- Starts a graceful stop of the crawler and returns a deferred that isfired when the crawler is stopped.
- class
scrapy.crawler.CrawlerRunner(settings=None)[source] - This is a convenient helper class that keeps track of, manages and runscrawlers inside an already setup
reactor.
The CrawlerRunner object must be instantiated with aSettings object.
This class shouldn’t be needed (since Scrapy is responsible of using itaccordingly) unless writing scripts that manually handle the crawlingprocess. See Run Scrapy from a script for an example.
crawl(crawler_or_spidercls, *args, **kwargs)[source]- Run a crawler with the provided arguments.
It will call the given Crawler’s crawl() method, whilekeeping track of it so it can be stopped later.
If crawler_or_spidercls isn’t a Crawlerinstance, this method will try to create one using this parameter asthe spider class given to it.
Returns a deferred that is fired when the crawling is finished.
Parameters:
- **crawler_or_spidercls** ([<code>Crawler</code>](#scrapy.crawler.Crawler) instance,[<code>Spider</code>]($fb832e20e85f228c.md#scrapy.spiders.Spider) subclass or string) – already created crawler, or a spider classor spider’s name inside the project to create it- **args** ([_list_](https://docs.python.org/3/library/stdtypes.html#list)) – arguments to initialize the spider- **kwargs** ([_dict_](https://docs.python.org/3/library/stdtypes.html#dict)) – keyword arguments to initialize the spider
- property
crawlers Set of
crawlersstarted bycrawl()and managed by this class.createcrawler(_crawler_or_spidercls)[source]Return a
Crawlerobject.- If
crawler_or_spiderclsis a Crawler, it is returned as-is. - If
crawler_or_spiderclsis a Spider subclass, a new Crawleris constructed for it. - If
crawler_or_spiderclsis a string, this function findsa spider with this name in a Scrapy project (using spider loader),then creates a Crawler instance for it.
- If
join()[source]Returns a deferred that is fired when all managed
crawlershavecompleted their executions.stop()[source]- Stops simultaneously all the crawling jobs taking place.
Returns a deferred that is fired when they all have ended.
- class
scrapy.crawler.CrawlerProcess(settings=None, install_root_handler=True)[source] - Bases:
scrapy.crawler.CrawlerRunner
A class to run multiple scrapy crawlers in a process simultaneously.
This class extends CrawlerRunner by adding supportfor starting a reactor and handling shutdownsignals, like the keyboard interrupt command Ctrl-C. It also configurestop-level logging.
This utility should be a better fit thanCrawlerRunner if you aren’t running anotherreactor within your application.
The CrawlerProcess object must be instantiated with aSettings object.
Parameters:install_root_handler – whether to install root logging handler(default: True)
This class shouldn’t be needed (since Scrapy is responsible of using itaccordingly) unless writing scripts that manually handle the crawlingprocess. See Run Scrapy from a script for an example.
It will call the given Crawler’s crawl() method, whilekeeping track of it so it can be stopped later.
If crawler_or_spidercls isn’t a Crawlerinstance, this method will try to create one using this parameter asthe spider class given to it.
Returns a deferred that is fired when the crawling is finished.
Parameters:
- **crawler_or_spidercls** ([<code>Crawler</code>](#scrapy.crawler.Crawler) instance,[<code>Spider</code>]($fb832e20e85f228c.md#scrapy.spiders.Spider) subclass or string) – already created crawler, or a spider classor spider’s name inside the project to create it- **args** ([_list_](https://docs.python.org/3/library/stdtypes.html#list)) – arguments to initialize the spider- **kwargs** ([_dict_](https://docs.python.org/3/library/stdtypes.html#dict)) – keyword arguments to initialize the spider
- property
crawlers Set of
crawlersstarted bycrawl()and managed by this class.Return a
Crawlerobject.- If
crawler_or_spiderclsis a Crawler, it is returned as-is. - If
crawler_or_spiderclsis a Spider subclass, a new Crawleris constructed for it. - If
crawler_or_spiderclsis a string, this function findsa spider with this name in a Scrapy project (using spider loader),then creates a Crawler instance for it.
- If
join()Returns a deferred that is fired when all managed
crawlershavecompleted their executions.start(stop_after_crawl=True)[source]- This method starts a
reactor, adjusts its poolsize toREACTOR_THREADPOOL_MAXSIZE, and installs a DNS cachebased onDNSCACHE_ENABLEDandDNSCACHE_SIZE.
If stop_after_crawl is True, the reactor will be stopped after allcrawlers have finished, using join().
Parameters:stop_after_crawl (boolean) – stop or not the reactor when allcrawlers have finished
Returns a deferred that is fired when they all have ended.
Settings API
scrapy.settings.SETTINGS_PRIORITIES- Dictionary that sets the key name and priority level of the defaultsettings priorities used in Scrapy.
Each item defines a settings entry point, giving it a code name foridentification and an integer priority. Greater priorities take moreprecedence over lesser ones when setting and retrieving values in theSettings class.
- SETTINGS_PRIORITIES = {
- 'default': 0,
- 'command': 10,
- 'project': 20,
- 'spider': 30,
- 'cmdline': 40,
- }
For a detailed explanation on each settings sources, see:Settings.
scrapy.settings.getsettings_priority(_priority)[source]- Small helper function that looks up a given string priority in the
SETTINGS_PRIORITIESdictionary and returns itsnumerical value, or directly returns a given numerical priority.
- class
scrapy.settings.Settings(values=None, priority='project')[source] - Bases:
scrapy.settings.BaseSettings
This object stores Scrapy settings for the configuration of internalcomponents, and can be used for any further customization.
It is a direct subclass and supports all methods ofBaseSettings. Additionally, after instantiationof this class, the new object will have the global default settingsdescribed on Built-in settings reference already populated.
- class
scrapy.settings.BaseSettings(values=None, priority='project')[source] - Instances of this class behave like dictionaries, but store prioritiesalong with their
(key, value)pairs, and can be frozen (i.e. markedimmutable).
Key-value entries can be passed on initialization with the valuesargument, and they would take the priority level (unless values isalready an instance of BaseSettings, in whichcase the existing priority levels will be kept). If the priorityargument is a string, the priority name will be looked up inSETTINGS_PRIORITIES. Otherwise, a specific integershould be provided.
Once the object is created, new settings can be loaded or updated with theset() method, and can be accessed withthe square bracket notation of dictionaries, or with theget() method of the instance and itsvalue conversion variants. When requesting a stored key, the value with thehighest priority will be retrieved.
copy()[source]- Make a deep copy of current settings.
This method returns a new instance of the Settings class,populated with the same values and their priorities.
Modifications to the new object won’t be reflected on the originalsettings.
copy_to_dict()[source]- Make a copy of current settings and convert to a dict.
This method returns a new dict populated with the same valuesand their priorities as the current settings.
Modifications to the returned dict won’t be reflected on the originalsettings.
This method can be useful for example for printing settingsin Scrapy shell.
freeze()[source]- Disable further changes to the current settings.
After calling this method, the present state of the settings will becomeimmutable. Trying to change values through the set() method andits variants won’t be possible and will be alerted.
frozencopy()[source]- Return an immutable copy of the current settings.
Alias for a freeze() call in the object returned by copy().
get(name, default=None)[source]- Get a setting value without affecting its original type.
Parameters:
- **name** (_string_) – the setting name- **default** (_any_) – the value to return if no setting is found
getbool(name, default=False)[source]- Get a setting value as a boolean.
1, '1', True` and 'True' return True,while 0, '0', False, 'False' and None return False.
For example, settings populated through environment variables set to'0' will return False when using this method.
Parameters:
- **name** (_string_) – the setting name- **default** (_any_) – the value to return if no setting is found
getdict(name, default=None)[source]- Get a setting value as a dictionary. If the setting original type is adictionary, a copy of it will be returned. If it is a string it will beevaluated as a JSON dictionary. In the case that it is a
BaseSettingsinstance itself, it will beconverted to a dictionary, containing all its current settings valuesas they would be returned byget(),and losing all information about priority and mutability.
Parameters:
- **name** (_string_) – the setting name- **default** (_any_) – the value to return if no setting is found
getfloat(name, default=0.0)[source]- Get a setting value as a float.
Parameters:
- **name** (_string_) – the setting name- **default** (_any_) – the value to return if no setting is found
getint(name, default=0)[source]- Get a setting value as an int.
Parameters:
- **name** (_string_) – the setting name- **default** (_any_) – the value to return if no setting is found
getlist(name, default=None)[source]- Get a setting value as a list. If the setting original type is a list, acopy of it will be returned. If it’s a string it will be split by “,”.
For example, settings populated through environment variables set to'one,two' will return a list [‘one’, ‘two’] when using this method.
Parameters:
- **name** (_string_) – the setting name- **default** (_any_) – the value to return if no setting is found
getpriority(name)[source]- Return the current numerical priority value of a setting, or
Noneifthe givennamedoes not exist.
Parameters:name (string) – the setting name
getwithbase(name)[source]- Get a composition of a dictionary-like setting and its __BASE_counterpart.
Parameters:name (string) – name of the dictionary-like setting
maxpriority()[source]Return the numerical value of the highest priority present throughoutall settings, or the numerical value for
defaultfromSETTINGS_PRIORITIESif there are no settingsstored.set(name, value, priority='project')[source]- Store a key/value attribute with a given priority.
Settings should be populated before configuring the Crawler object(through the configure() method),otherwise they won’t have any effect.
Parameters:
- **name** (_string_) – the setting name- **value** (_any_) – the value to associate with the setting- **priority** (_string__ or _[_int_](https://docs.python.org/3/library/functions.html#int)) – the priority of the setting. Should be a key of[<code>SETTINGS_PRIORITIES</code>](#scrapy.settings.SETTINGS_PRIORITIES) or an integer
setmodule(module, priority='project')[source]- Store settings from a module with a given priority.
This is a helper function that callsset() for every globally declareduppercase variable of module with the provided priority.
Parameters:
- **module** (_module object__ or __string_) – the module or the path of the module- **priority** (_string__ or _[_int_](https://docs.python.org/3/library/functions.html#int)) – the priority of the settings. Should be a key of[<code>SETTINGS_PRIORITIES</code>](#scrapy.settings.SETTINGS_PRIORITIES) or an integer
update(values, priority='project')[source]- Store key/value pairs with a given priority.
This is a helper function that callsset() for every item of valueswith the provided priority.
If values is a string, it is assumed to be JSON-encoded and parsedinto a dict with json.loads() first. If it is aBaseSettings instance, the per-key prioritieswill be used and the priority parameter ignored. This allowsinserting/updating settings with different priorities with a singlecommand.
Parameters:
- **values** (dict or string or [<code>BaseSettings</code>](#scrapy.settings.BaseSettings)) – the settings names and values- **priority** (_string__ or _[_int_](https://docs.python.org/3/library/functions.html#int)) – the priority of the settings. Should be a key of[<code>SETTINGS_PRIORITIES</code>](#scrapy.settings.SETTINGS_PRIORITIES) or an integer
SpiderLoader API
- class
scrapy.spiderloader.SpiderLoader[source] - This class is in charge of retrieving and handling the spider classesdefined across the project.
Custom spider loaders can be employed by specifying their path in theSPIDER_LOADER_CLASS project setting. They must fully implementthe scrapy.interfaces.ISpiderLoader interface to guarantee anerrorless execution.
fromsettings(_settings)[source]- This class method is used by Scrapy to create an instance of the class.It’s called with the current project settings, and it loads the spidersfound recursively in the modules of the
SPIDER_MODULESsetting.
Parameters:settings (Settings instance) – project settings
load(spider_name)[source]- Get the Spider class with the given name. It’ll look into the previouslyloaded spiders for a spider class with name
spider_nameand will raisea KeyError if not found.
Parameters:spider_name (str) – spider class name
list()[source]Get the names of the available spiders in the project.
findby_request(_request)[source]- List the spiders’ names that can handle the given request. Will try tomatch the request’s url against the domains of the spiders.
Parameters:request (Request instance) – queried request
Signals API
- class
scrapy.signalmanager.SignalManager(sender=_Anonymous)[source] connect(receiver, signal, **kwargs)[source]- Connect a receiver function to a signal.
The signal can be any object, although Scrapy comes with somepredefined signals that are documented in the Signalssection.
Parameters:
- **receiver** (_callable_) – the function to be connected- **signal** ([_object_](https://docs.python.org/3/library/functions.html#object)) – the signal to connect to
disconnect(receiver, signal, **kwargs)[source]Disconnect a receiver function from a signal. This has theopposite effect of the
connect()method, and the argumentsare the same.disconnectall(_signal, **kwargs)[source]- Disconnect all receivers from the given signal.
Parameters:signal (object) – the signal to disconnect from
sendcatch_log(_signal, **kwargs)[source]- Send a signal, catch exceptions and log them.
The keyword arguments are passed to the signal handlers (connectedthrough the connect() method).
sendcatch_log_deferred(_signal, **kwargs)[source]- Like
send_catch_log()but supports returningDeferredobjects from signal handlers.
Returns a Deferred that gets fired once all signal handlersdeferreds were fired. Send a signal, catch exceptions and log them.
The keyword arguments are passed to the signal handlers (connectedthrough the connect() method).
Stats Collector API
There are several Stats Collectors available under thescrapy.statscollectors module and they all implement the StatsCollector API defined by the StatsCollectorclass (which they all inherit from).
- class
scrapy.statscollectors.StatsCollector[source] getvalue(_key, default=None)[source]Return the value for the given stats key or default if it doesn’t exist.
get_stats()[source]Get all stats from the currently running spider as a dict.
setvalue(_key, value)[source]Set the given value for the given stats key.
setstats(_stats)[source]Override the current stats with the dict passed in
statsargument.incvalue(_key, count=1, start=0)[source]Increment the value of the given stats key, by the given count,assuming the start value given (when it’s not set).
maxvalue(_key, value)[source]Set the given value for the given key only if current value for thesame key is lower than value. If there is no current value for thegiven key, the value is always set.
minvalue(_key, value)[source]Set the given value for the given key only if current value for thesame key is greater than value. If there is no current value for thegiven key, the value is always set.
clear_stats()[source]- Clear all stats.
The following methods are not part of the stats collection api but insteadused when implementing custom stats collectors:
