- Broad Crawls
- Use the right SCHEDULER_PRIORITY_QUEUE
- Increase concurrency
- Increase Twisted IO thread pool maximum size
- Setup your own DNS
- Reduce log level
- Disable cookies
- Disable retries
- Reduce download timeout
- Disable redirects
- Enable crawling of “Ajax Crawlable Pages”
- Crawl in BFO order
- Be mindful of memory leaks
- Install a specific Twisted reactor
Scrapy defaults are optimized for crawling specific sites. These sites areoften handled by a single Scrapy spider, although this is not necessary orrequired (for example, there are generic spiders that handle any given sitethrown at them).
In addition to this “focused crawl”, there is another common type of crawlingwhich covers a large (potentially unlimited) number of domains, and is onlylimited by time or other arbitrary constraint, rather than stopping when thedomain was crawled to completion or when there are no more requests to perform.These are called “broad crawls” and is the typical crawlers employed by searchengines.
These are some common properties often found in broad crawls:
- they crawl many domains (often, unbounded) instead of a specific set of sites
- they don’t necessarily crawl domains to completion, because it would beimpractical (or impossible) to do so, and instead limit the crawl by time ornumber of pages crawled
- they are simpler in logic (as opposed to very complex spiders with manyextraction rules) because data is often post-processed in a separate stage
- they crawl many domains concurrently, which allows them to achieve fastercrawl speeds by not being limited by any particular site constraint (each siteis crawled slowly to respect politeness, but many sites are crawled inparallel)
As said above, Scrapy default settings are optimized for focused crawls, notbroad crawls. However, due to its asynchronous architecture, Scrapy is verywell suited for performing fast broad crawls. This page summarizes some thingsyou need to keep in mind when using Scrapy for doing broad crawls, along withconcrete suggestions of Scrapy settings to tune in order to achieve anefficient broad crawl.
Scrapy’s default scheduler priority queue is
'scrapy.pqueues.ScrapyPriorityQueue'.It works best during single-domain crawl. It does not work well with crawlingmany different domains in parallel
To apply the recommended priority queue use:
- SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
Concurrency is the number of requests that are processed in parallel. There isa global limit (
CONCURRENT_REQUESTS) and an additional limit thatcan be set either per domain (
CONCURRENT_REQUESTS_PER_DOMAIN) or perIP (
The default global concurrency limit in Scrapy is not suitable for crawlingmany different domains in parallel, so you will want to increase it. How muchto increase it will depend on how much CPU and memory you crawler will haveavailable.
A good starting point is
- CONCURRENT_REQUESTS = 100
But the best way to find out is by doing some trials and identifying at whatconcurrency your Scrapy process gets CPU bounded. For optimum performance, youshould pick a concurrency where CPU usage is at 80-90%.
Increasing concurrency also increases memory usage. If memory usage is aconcern, you might need to lower your global concurrency limit accordingly.
Currently Scrapy does DNS resolution in a blocking way with usage of threadpool. With higher concurrency levels the crawling could be slow or even failhitting DNS resolver timeouts. Possible solution to increase the number ofthreads handling DNS queries. The DNS queue will be processed faster speedingup establishing of connection and crawling overall.
To increase maximum thread pool size use:
- REACTOR_THREADPOOL_MAXSIZE = 20
If you have multiple crawling processes and single central DNS, it can actlike DoS attack on the DNS server resulting to slow down of entire network oreven blocking your machines. To avoid this setup your own DNS server withlocal cache and upstream to some large DNS like OpenDNS or Verizon.
When doing broad crawls you are often only interested in the crawl rates youget and any errors found. These stats are reported by Scrapy when using the
INFO log level. In order to save CPU (and log storage requirements) youshould not use
DEBUG log level when preforming large broad crawls inproduction. Using
DEBUG level when developing your (broad) crawler may befine though.
To set the log level use:
- LOG_LEVEL = 'INFO'
Disable cookies unless you really need. Cookies are often not needed whendoing broad crawls (search engine crawlers ignore them), and they improveperformance by saving some CPU cycles and reducing the memory footprint of yourScrapy crawler.
To disable cookies use:
- COOKIES_ENABLED = False
Retrying failed HTTP requests can slow down the crawls substantially, speciallywhen sites causes are very slow (or fail) to respond, thus causing a timeouterror which gets retried many times, unnecessarily, preventing crawler capacityto be reused for other domains.
To disable retries use:
- RETRY_ENABLED = False
Unless you are crawling from a very slow connection (which shouldn’t be thecase for broad crawls) reduce the download timeout so that stuck requests arediscarded quickly and free up capacity to process the next ones.
To reduce the download timeout use:
- DOWNLOAD_TIMEOUT = 15
Consider disabling redirects, unless you are interested in following them. Whendoing broad crawls it’s common to save redirects and resolve them whenrevisiting the site at a later crawl. This also help to keep the number ofrequest constant per crawl batch, otherwise redirect loops may cause thecrawler to dedicate too many resources on any specific domain.
To disable redirects use:
- REDIRECT_ENABLED = False
Some pages (up to 1%, based on empirical data from year 2013) declarethemselves as ajax crawlable. This means they provide plain HTMLversion of content that is usually available only via AJAX.Pages can indicate it in two ways:
- by using
#!in URL - this is the default way;
- by using a special meta tag - this way is used on“main”, “index” website pages.Scrapy handles (1) automatically; to handle (2) enableAjaxCrawlMiddleware:
- AJAXCRAWL_ENABLED = True
When doing broad crawls it’s common to crawl a lot of “index” web pages;AjaxCrawlMiddleware helps to crawl them correctly.It is turned OFF by default because it has some performance overhead,and enabling it for focused crawls doesn’t make much sense.
In broad crawls, however, page crawling tends to be faster than pageprocessing. As a result, unprocessed early requests stay in memory until thefinal depth is reached, which can significantly increase memory usage.
Crawl in BFO order instead to save memory.
If the crawl is exceeding the system’s capabilities, you might want to tryinstalling a specific Twisted reactor, via the