Stats Collection

Scrapy provides a convenient facility for collecting stats in the form ofkey/values, where values are often counters. The facility is called the StatsCollector, and can be accessed through the statsattribute of the Crawler API, as illustrated by the examples inthe Common Stats Collector uses section below.

However, the Stats Collector is always available, so you can always import itin your module and use its API (to increment or set new stat keys), regardlessof whether the stats collection is enabled or not. If it’s disabled, the APIwill still work but it won’t collect anything. This is aimed at simplifying thestats collector usage: you should spend no more than one line of code forcollecting stats in your spider, Scrapy extension, or whatever code you’reusing the Stats Collector from.

Another feature of the Stats Collector is that it’s very efficient (whenenabled) and extremely efficient (almost unnoticeable) when disabled.

The Stats Collector keeps a stats table per open spider which is automaticallyopened when the spider is opened, and closed when the spider is closed.

Common Stats Collector uses

Access the stats collector through the statsattribute. Here is an example of an extension that access stats:

  1. class ExtensionThatAccessStats(object):
  2.  
  3. def __init__(self, stats):
  4. self.stats = stats
  5.  
  6. @classmethod
  7. def from_crawler(cls, crawler):
  8. return cls(crawler.stats)

Set stat value:

  1. stats.set_value('hostname', socket.gethostname())

Increment stat value:

  1. stats.inc_value('custom_count')

Set stat value only if greater than previous:

  1. stats.max_value('max_items_scraped', value)

Set stat value only if lower than previous:

  1. stats.min_value('min_free_memory_percent', value)

Get stat value:

  1. >>> stats.get_value('custom_count')
  2. 1

Get all stats:

  1. >>> stats.get_stats()
  2. {'custom_count': 1, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)}

Available Stats Collectors

Besides the basic StatsCollector there are other Stats Collectorsavailable in Scrapy which extend the basic Stats Collector. You can selectwhich Stats Collector to use through the STATS_CLASS setting. Thedefault Stats Collector used is the MemoryStatsCollector.

MemoryStatsCollector

  • class scrapy.statscollectors.MemoryStatsCollector[source]
  • A simple stats collector that keeps the stats of the last scraping run (foreach spider) in memory, after they’re closed. The stats can be accessedthrough the spider_stats attribute, which is a dict keyed by spiderdomain name.

This is the default Stats Collector used in Scrapy.

  • spider_stats
  • A dict of dicts (keyed by spider name) containing the stats of the lastscraping run for each spider.

DummyStatsCollector

  • class scrapy.statscollectors.DummyStatsCollector[source]
  • A Stats collector which does nothing but is very efficient (because it doesnothing). This stats collector can be set via the STATS_CLASSsetting, to disable stats collect in order to improve performance. However,the performance penalty of stats collection is usually marginal compared toother Scrapy workload like parsing pages.