Scrapy at a glance

Scrapy is an application framework for crawling web sites and extractingstructured data which can be used for a wide range of useful applications, likedata mining, information processing or historical archival.

Even though Scrapy was originally designed for web scraping, it can also beused to extract data using APIs (such as Amazon Associates Web Services) oras a general purpose web crawler.

Walk-through of an example spider

In order to show you what Scrapy brings to the table, we’ll walk you through anexample of a Scrapy Spider using the simplest way to run a spider.

Here’s the code for a spider that scrapes famous quotes from websitehttp://quotes.toscrape.com, following the pagination:

  1. import scrapy
  2.  
  3.  
  4. class QuotesSpider(scrapy.Spider):
  5. name = 'quotes'
  6. start_urls = [
  7. 'http://quotes.toscrape.com/tag/humor/',
  8. ]
  9.  
  10. def parse(self, response):
  11. for quote in response.css('div.quote'):
  12. yield {
  13. 'author': quote.xpath('span/small/text()').get(),
  14. 'text': quote.css('span.text::text').get(),
  15. }
  16.  
  17. next_page = response.css('li.next a::attr("href")').get()
  18. if next_page is not None:
  19. yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.pyand run the spider using the runspider command:

  1. scrapy runspider quotes_spider.py -o quotes.json

When this finishes you will have in the quotes.json file a list of thequotes in JSON format, containing text and author, looking like this (reformattedhere for better readability):

  1. [{
  2. "author": "Jane Austen",
  3. "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
  4. },
  5. {
  6. "author": "Groucho Marx",
  7. "text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d"
  8. },
  9. {
  10. "author": "Steve Martin",
  11. "text": "\u201cA day without sunshine is like, you know, night.\u201d"
  12. },
  13. ...]

What just happened?

When you ran the command scrapy runspider quotes_spider.py, Scrapy looked for aSpider definition inside it and ran it through its crawler engine.

The crawl started by making requests to the URLs defined in the starturlsattribute (in this case, only the URL for quotes in _humor category)and called the default callback method parse, passing the response object asan argument. In the parse callback, we loop through the quote elementsusing a CSS Selector, yield a Python dict with the extracted quote text and author,look for a link to the next page and schedule another request using the sameparse method as callback.

Here you notice one of the main advantages about Scrapy: requests arescheduled and processed asynchronously. Thismeans that Scrapy doesn’t need to wait for a request to be finished andprocessed, it can send another request or do other things in the meantime. Thisalso means that other requests can keep going even if some request fails or anerror happens while handling it.

While this enables you to do very fast crawls (sending multiple concurrentrequests at the same time, in a fault-tolerant way) Scrapy also gives youcontrol over the politeness of the crawl through a few settings. You can do things like setting a download delay betweeneach request, limiting amount of concurrent requests per domain or per IP, andeven using an auto-throttling extension that triesto figure out these automatically.

Note

This is using feed exports to generate theJSON file, you can easily change the export format (XML or CSV, for example) or thestorage backend (FTP or Amazon S3, for example). You can also write anitem pipeline to store the items in a database.

What else?

You’ve seen how to extract and store items from a website using Scrapy, butthis is just the surface. Scrapy provides a lot of powerful features for makingscraping easy and efficient, such as:

  • Built-in support for selecting and extracting datafrom HTML/XML sources using extended CSS selectors and XPath expressions,with helper methods to extract using regular expressions.
  • An interactive shell console (IPython aware) for tryingout the CSS and XPath expressions to scrape data, very useful when writing ordebugging your spiders.
  • Built-in support for generating feed exports inmultiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP,S3, local filesystem)
  • Robust encoding support and auto-detection, for dealing with foreign,non-standard and broken encoding declarations.
  • Strong extensibility support, allowing you to plugin your own functionality using signals and awell-defined API (middlewares, extensions, andpipelines).
  • Wide range of built-in extensions and middlewares for handling:
    • cookies and session handling
    • HTTP features like compression, authentication, caching
    • user-agent spoofing
    • robots.txt
    • crawl depth restriction
    • and more
  • A Telnet console for hooking into a Pythonconsole running inside your Scrapy process, to introspect and debug yourcrawler
  • Plus other goodies like reusable spiders to crawl sites from Sitemaps andXML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scrapeditems, a caching DNS resolver, and much more!

What’s next?

The next steps for you are to install Scrapy,follow through the tutorial to learn how to createa full-blown Scrapy project and join the community. Thanks for yourinterest!