Benchmarking

Scrapy comes with a simple benchmarking suite that spawns a local HTTP server and crawls it at the maximum possible speed. The goal of this benchmarking is to get an idea of how Scrapy performs in your hardware, in order to have a common baseline for comparisons. It uses a simple spider that does nothing and just follows links.

To run it use:

  1. scrapy bench

You should see an output like this:

  1. 2016-12-1621:18:48[scrapy.utils.log] INFO:Scrapy1.2.2 started (bot: quotesbot)
  2. 2016-12-1621:18:48[scrapy.utils.log] INFO:Overridden settings:{'CLOSESPIDER_TIMEOUT':10,'ROBOTSTXT_OBEY':True,'SPIDER_MODULES':['quotesbot.spiders'],'LOGSTATS_INTERVAL':1,'BOT_NAME':'quotesbot','LOG_LEVEL':'INFO','NEWSPIDER_MODULE':'quotesbot.spiders'}
  3. 2016-12-1621:18:49[scrapy.middleware] INFO:Enabled extensions:
  4. ['scrapy.extensions.closespider.CloseSpider',
  5. 'scrapy.extensions.logstats.LogStats',
  6. 'scrapy.extensions.telnet.TelnetConsole',
  7. 'scrapy.extensions.corestats.CoreStats']
  8. 2016-12-1621:18:49[scrapy.middleware] INFO:Enabled downloader middlewares:
  9. ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
  10. 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
  11. 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
  12. 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
  13. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
  14. 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
  15. 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
  16. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
  17. 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
  18. 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
  19. 'scrapy.downloadermiddlewares.stats.DownloaderStats']
  20. 2016-12-1621:18:49[scrapy.middleware] INFO:Enabled spider middlewares:
  21. ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
  22. 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
  23. 'scrapy.spidermiddlewares.referer.RefererMiddleware',
  24. 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
  25. 'scrapy.spidermiddlewares.depth.DepthMiddleware']
  26. 2016-12-1621:18:49[scrapy.middleware] INFO:Enabled item pipelines:
  27. []
  28. 2016-12-1621:18:49[scrapy.core.engine] INFO:Spider opened
  29. 2016-12-1621:18:49[scrapy.extensions.logstats] INFO:Crawled0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  30. 2016-12-1621:18:50[scrapy.extensions.logstats] INFO:Crawled70 pages (at 4200 pages/min), scraped 0 items (at 0 items/min)
  31. 2016-12-1621:18:51[scrapy.extensions.logstats] INFO:Crawled134 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
  32. 2016-12-1621:18:52[scrapy.extensions.logstats] INFO:Crawled198 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
  33. 2016-12-1621:18:53[scrapy.extensions.logstats] INFO:Crawled254 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
  34. 2016-12-1621:18:54[scrapy.extensions.logstats] INFO:Crawled302 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
  35. 2016-12-1621:18:55[scrapy.extensions.logstats] INFO:Crawled358 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
  36. 2016-12-1621:18:56[scrapy.extensions.logstats] INFO:Crawled406 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
  37. 2016-12-1621:18:57[scrapy.extensions.logstats] INFO:Crawled438 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
  38. 2016-12-1621:18:58[scrapy.extensions.logstats] INFO:Crawled470 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
  39. 2016-12-1621:18:59[scrapy.core.engine] INFO:Closing spider (closespider_timeout)
  40. 2016-12-1621:18:59[scrapy.extensions.logstats] INFO:Crawled518 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
  41. 2016-12-1621:19:00[scrapy.statscollectors] INFO:DumpingScrapy stats:
  42. {'downloader/request_bytes':229995,
  43. 'downloader/request_count':534,
  44. 'downloader/request_method_count/GET':534,
  45. 'downloader/response_bytes':1565504,
  46. 'downloader/response_count':534,
  47. 'downloader/response_status_count/200':534,
  48. 'finish_reason':'closespider_timeout',
  49. 'finish_time': datetime.datetime(2016,12,16,16,19,0,647725),
  50. 'log_count/INFO':17,
  51. 'request_depth_max':19,
  52. 'response_received_count':534,
  53. 'scheduler/dequeued':533,
  54. 'scheduler/dequeued/memory':533,
  55. 'scheduler/enqueued':10661,
  56. 'scheduler/enqueued/memory':10661,
  57. 'start_time': datetime.datetime(2016,12,16,16,18,49,799869)}
  58. 2016-12-1621:19:00[scrapy.core.engine] INFO:Spider closed (closespider_timeout)

That tells you that Scrapy is able to crawl about 3000 pages per minute in the hardware where you run it. Note that this is a very simple spider intended to follow links, any custom spider you write will probably do more stuff which results in slower crawl rates. How slower depends on how much your spider does and how well it’s written.

In the future, more cases will be added to the benchmarking suite to cover other common scenarios.