Selecting dynamically-loaded content

Some webpages show the desired data when you load them in a web browser.However, when you download them using Scrapy, you cannot reach the desired datausing selectors.

When this happens, the recommended approach is tofind the data source and extract the datafrom it.

If you fail to do that, and you can nonetheless access the desired data throughthe DOM from your web browser, seePre-rendering JavaScript.

Finding the data source

To extract the desired data, you must first find its source location.

If the data is in a non-text-based format, such as an image or a PDF document,use the network tool of your web browser to findthe corresponding request, and reproduce it.

If your web browser lets you select the desired data as text, the data may bedefined in embedded JavaScript code, or loaded from an external resource in atext-based format.

In that case, you can use a tool like wgrep to find the URL of that resource.

If the data turns out to come from the original URL itself, you mustinspect the source code of the webpage todetermine where the data is located.

If the data comes from a different URL, you will need to reproduce thecorresponding request.

Inspecting the source code of a webpage

Sometimes you need to inspect the source code of a webpage (not theDOM) to determine where some desired data is located.

Use Scrapy’s fetch command to download the webpage contents as seenby Scrapy:

  1. scrapy fetch --nolog https://example.com > response.html

If the desired data is in embedded JavaScript code within a <script/>element, see Parsing JavaScript code.

If you cannot find the desired data, first make sure it’s not just Scrapy:download the webpage with an HTTP client like curl or wget and see if theinformation can be found in the response they get.

If they get a response with the desired data, modify your ScrapyRequest to match that of the other HTTP client. Forexample, try using the same user-agent string (USER_AGENT) or thesame headers.

If they also get a response without the desired data, you’ll need to takesteps to make your request more similar to that of the web browser. SeeReproducing requests.

Reproducing requests

Sometimes we need to reproduce a request the way our web browser performs it.

Use the network tool of your web browser to seehow your web browser performs the desired request, and try to reproduce thatrequest with Scrapy.

It might be enough to yield a Request with the same HTTPmethod and URL. However, you may also need to reproduce the body, headers andform parameters (see FormRequest) of that request.

As all major browsers allow to export the requests in cURL format, Scrapy incorporates the methodfrom_curl() to generate an equivalentRequest from a cURL command. To get more informationvisit request from curl inside the networktool section.

Once you get the expected response, you can extract the desired data fromit.

You can reproduce any request with Scrapy. However, some times reproducing allnecessary requests may not seem efficient in developer time. If that is yourcase, and crawling speed is not a major concern for you, you can alternativelyconsider JavaScript pre-rendering.

If you get the expected response sometimes, but not always, the issue isprobably not your request, but the target server. The target server might bebuggy, overloaded, or banning some of your requests.

Handling different response formats

Once you have a response with the desired data, how you extract the desireddata from it depends on the type of response:

  1. data = json.loads(response.text)

If the desired data is inside HTML or XML code embedded within JSON data,you can load that HTML or XML code into aSelector and thenuse it as usual:

  1. selector = Selector(data['html'])
  • If the response is an image or another format based on images (e.g. PDF),read the response as bytes fromresponse.body and use an OCRsolution to extract the desired data as text.

For example, you can use pytesseract. To read a table from a PDF,tabula-py may be a better choice.

  • If the response is SVG, or HTML with embedded SVG containing the desireddata, you may be able to extract the desired data usingselectors, since SVG is based on XML.

Otherwise, you might need to convert the SVG code into a raster image, andhandle that raster image.

Parsing JavaScript code

If the desired data is hardcoded in JavaScript, you first need to get theJavaScript code:

  • If the JavaScript code is in a JavaScript file, simply readresponse.text.
  • If the JavaScript code is within a <script/> element of an HTML page,use selectors to extract the text within that<script/> element.

Once you have a string with the JavaScript code, you can extract the desireddata from it:

For example, if the JavaScript code contains a separate line likevar data = {"field": "value"}; you can extract that data as follows:

  1. >>> pattern = r'\bvar\s+data\s*=\s*(\{.*?\})\s*;\s*\n'
  2. >>> json_data = response.css('script::text').re_first(pattern)
  3. >>> json.loads(json_data)
  4. {'field': 'value'}
  • Otherwise, use js2xml to convert the JavaScript code into an XML documentthat you can parse using selectors.

For example, if the JavaScript code containsvar data = {field: "value"}; you can extract that data as follows:

  1. >>> import js2xml
  2. >>> import lxml.etree
  3. >>> from parsel import Selector
  4. >>> javascript = response.css('script::text').get()
  5. >>> xml = lxml.etree.tostring(js2xml.parse(javascript), encoding='unicode')
  6. >>> selector = Selector(text=xml)
  7. >>> selector.css('var[name="data"]').get()
  8. '<var name="data"><object><property name="field"><string>value</string></property></object></var>'

Pre-rendering JavaScript

On webpages that fetch data from additional requests, reproducing thoserequests that contain the desired data is the preferred approach. The effort isoften worth the result: structured, complete data with minimum parsing time andnetwork transfer.

However, sometimes it can be really hard to reproduce certain requests. Or youmay need something that no request can give you, such as a screenshot of awebpage as seen in a web browser.

In these cases use the Splash JavaScript-rendering service, along withscrapy-splash for seamless integration.

Splash returns as HTML the DOM of a webpage, so thatyou can parse it with selectors. It provides greatflexibility through configuration or scripting.

If you need something beyond what Splash offers, such as interacting with theDOM on-the-fly from Python code instead of using a previously-written script,or handling multiple web browser windows, you might need touse a headless browser instead.

Using a headless browser

A headless browser is a special web browser that provides an API forautomation.

The easiest way to use a headless browser with Scrapy is to use Selenium,along with scrapy-selenium for seamless integration.