Benchmarking

New in version 0.17.

Scrapy comes with a simple benchmarking suite that spawns a local HTTP serverand crawls it at the maximum possible speed. The goal of this benchmarking isto get an idea of how Scrapy performs in your hardware, in order to have acommon baseline for comparisons. It uses a simple spider that does nothing andjust follows links.

To run it use:

  1. scrapy bench

You should see an output like this:

  1. 2016-12-16 21:18:48 [scrapy.utils.log] INFO: Scrapy 1.2.2 started (bot: quotesbot)
  2. 2016-12-16 21:18:48 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['quotesbot.spiders'], 'LOGSTATS_INTERVAL': 1, 'BOT_NAME': 'quotesbot', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'quotesbot.spiders'}
  3. 2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled extensions:
  4. ['scrapy.extensions.closespider.CloseSpider',
  5. 'scrapy.extensions.logstats.LogStats',
  6. 'scrapy.extensions.telnet.TelnetConsole',
  7. 'scrapy.extensions.corestats.CoreStats']
  8. 2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
  9. ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
  10. 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
  11. 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
  12. 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
  13. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
  14. 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
  15. 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
  16. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
  17. 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
  18. 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
  19. 'scrapy.downloadermiddlewares.stats.DownloaderStats']
  20. 2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled spider middlewares:
  21. ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
  22. 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
  23. 'scrapy.spidermiddlewares.referer.RefererMiddleware',
  24. 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
  25. 'scrapy.spidermiddlewares.depth.DepthMiddleware']
  26. 2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled item pipelines:
  27. []
  28. 2016-12-16 21:18:49 [scrapy.core.engine] INFO: Spider opened
  29. 2016-12-16 21:18:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  30. 2016-12-16 21:18:50 [scrapy.extensions.logstats] INFO: Crawled 70 pages (at 4200 pages/min), scraped 0 items (at 0 items/min)
  31. 2016-12-16 21:18:51 [scrapy.extensions.logstats] INFO: Crawled 134 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
  32. 2016-12-16 21:18:52 [scrapy.extensions.logstats] INFO: Crawled 198 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
  33. 2016-12-16 21:18:53 [scrapy.extensions.logstats] INFO: Crawled 254 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
  34. 2016-12-16 21:18:54 [scrapy.extensions.logstats] INFO: Crawled 302 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
  35. 2016-12-16 21:18:55 [scrapy.extensions.logstats] INFO: Crawled 358 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
  36. 2016-12-16 21:18:56 [scrapy.extensions.logstats] INFO: Crawled 406 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
  37. 2016-12-16 21:18:57 [scrapy.extensions.logstats] INFO: Crawled 438 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
  38. 2016-12-16 21:18:58 [scrapy.extensions.logstats] INFO: Crawled 470 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
  39. 2016-12-16 21:18:59 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
  40. 2016-12-16 21:18:59 [scrapy.extensions.logstats] INFO: Crawled 518 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
  41. 2016-12-16 21:19:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
  42. {'downloader/request_bytes': 229995,
  43. 'downloader/request_count': 534,
  44. 'downloader/request_method_count/GET': 534,
  45. 'downloader/response_bytes': 1565504,
  46. 'downloader/response_count': 534,
  47. 'downloader/response_status_count/200': 534,
  48. 'finish_reason': 'closespider_timeout',
  49. 'finish_time': datetime.datetime(2016, 12, 16, 16, 19, 0, 647725),
  50. 'log_count/INFO': 17,
  51. 'request_depth_max': 19,
  52. 'response_received_count': 534,
  53. 'scheduler/dequeued': 533,
  54. 'scheduler/dequeued/memory': 533,
  55. 'scheduler/enqueued': 10661,
  56. 'scheduler/enqueued/memory': 10661,
  57. 'start_time': datetime.datetime(2016, 12, 16, 16, 18, 49, 799869)}
  58. 2016-12-16 21:19:00 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

That tells you that Scrapy is able to crawl about 3000 pages per minute in thehardware where you run it. Note that this is a very simple spider intended tofollow links, any custom spider you write will probably do more stuff whichresults in slower crawl rates. How slower depends on how much your spider doesand how well it’s written.

In the future, more cases will be added to the benchmarking suite to coverother common scenarios.