Requests and Responses

Scrapy uses Request and Response objects for crawling websites.

Typically, Request objects are generated in the spiders and passacross the system until they reach the Downloader, which executes the requestand returns a Response object which travels back to the spider thatissued the request.

Both Request and Response classes have subclasses which addfunctionality not required in the base classes. These are describedbelow in Request subclasses andResponse subclasses.

Request objects

  • class scrapy.http.Request(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None, cb_kwargs=None)[source]
  • A Request object represents an HTTP request, which is usuallygenerated in the Spider and executed by the Downloader, and thus generatinga Response.

Parameters:

  • url (string) –the URL of this request

If the URL is invalid, a ValueError exception is raised.

  • callback (callable) – the function that will be called with the response of thisrequest (once its downloaded) as its first parameter. For more informationsee Passing additional data to callback functions below.If a Request doesn’t specify a callback, the spider’sparse() method will be used.Note that if exceptions are raised during processing, errback is called instead.
  • method (string) – the HTTP method of this request. Defaults to 'GET'.
  • meta (dict) – the initial values for the Request.meta attribute. Ifgiven, the dict passed in this parameter will be shallow copied.
  • body (str or __unicode) – the request body. If a unicode is passed, then it’s encoded tostr using the encoding passed (which defaults to utf-8). Ifbody is not given, an empty string is stored. Regardless of thetype of this argument, the final value stored will be a str (neverunicode or None).
  • headers (dict) – the headers of this request. The dict values can be strings(for single valued headers) or lists (for multi-valued headers). IfNone is passed as value, the HTTP header will not be sent at all.
  • cookies (dict or list) –the request cookies. These can be sent in two forms.

    • Using a dict:
  1. request_with_cookies = Request(url="http://www.example.com",
  2. cookies={'currency': 'USD', 'country': 'UY'})
  1. - Using a list of dicts:
  1. request_with_cookies = Request(url="http://www.example.com",
  2. cookies=[{'name': 'currency',
  3. 'value': 'USD',
  4. 'domain': 'example.com',
  5. 'path': '/currency'}])

The latter form allows for customizing the domain and pathattributes of the cookie. This is only useful if the cookies are savedfor later requests.

When some site returns cookies (in a response) those are stored in thecookies for that domain and will be sent again in future requests.That’s the typical behaviour of any regular web browser.

To create a request that does not send stored cookies and does notstore received cookies, set the dont_merge_cookies key to Truein request.meta.

Example of a request that sends manually-defined cookies and ignorescookie storage:

  1. Request(
  2. url="http://www.example.com",
  3. cookies={'currency': 'USD', 'country': 'UY'},
  4. meta={'dont_merge_cookies': True},
  5. )

For more info see CookiesMiddleware.

  • encoding (string) – the encoding of this request (defaults to 'utf-8').This encoding will be used to percent-encode the URL and to convert thebody to str (if given as unicode).
  • priority (int) – the priority of this request (defaults to 0).The priority is used by the scheduler to define the order used to processrequests. Requests with a higher priority value will execute earlier.Negative values are allowed in order to indicate relatively low-priority.
  • dont_filter (boolean) – indicates that this request should not be filtered bythe scheduler. This is used when you want to perform an identicalrequest multiple times, to ignore the duplicates filter. Use it withcare, or you will get into crawling loops. Default to False.
  • errback (callable) –a function that will be called if any exception wasraised while processing the request. This includes pages that failedwith 404 HTTP errors and such. It receives aFailure as first parameter.For more information,see Using errbacks to catch exceptions in request processing below.

Changed in version 2.0: The callback parameter is no longer required when the _errback_parameter is specified.

  • flags (list) – Flags sent to the request, can be used for logging or similar purposes.
  • cb_kwargs (dict) – A dict with arbitrary data that will be passed as keyword arguments to the Request’s callback.
  • url
  • A string containing the URL of this request. Keep in mind that thisattribute contains the escaped URL, so it can differ from the URL passed inthe init method.

This attribute is read-only. To change the URL of a Request usereplace().

  • method
  • A string representing the HTTP method in the request. This is guaranteed tobe uppercase. Example: "GET", "POST", "PUT", etc

  • headers

  • A dictionary-like object which contains the request headers.

  • body

  • A str that contains the request body.

This attribute is read-only. To change the body of a Request usereplace().

  • meta
  • A dict that contains arbitrary metadata for this request. This dict isempty for new Requests, and is usually populated by different Scrapycomponents (extensions, middlewares, etc). So the data contained in thisdict depends on the extensions you have enabled.

See Request.meta special keys for a list of special meta keysrecognized by Scrapy.

This dict is shallow copied when the request is cloned using thecopy() or replace() methods, and can also be accessed, in yourspider, from the response.meta attribute.

  • cb_kwargs
  • A dictionary that contains arbitrary metadata for this request. Its contentswill be passed to the Request’s callback as keyword arguments. It is emptyfor new Requests, which means by default callbacks only get a Responseobject as argument.

This dict is shallow copied when the request is cloned using thecopy() or replace() methods, and can also be accessed, in yourspider, from the response.cb_kwargs attribute.

  • copy()[source]
  • Return a new Request which is a copy of this Request. See also:Passing additional data to callback functions.

  • replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback, errback, cb_kwargs])[source]

  • Return a Request object with the same members, except for those membersgiven new values by whichever keyword arguments are specified. TheRequest.cb_kwargs and Request.meta attributes are shallowcopied by default (unless new values are given as arguments). See alsoPassing additional data to callback functions.

  • classmethod fromcurl(_curl_command, ignore_unknown_options=True, **kwargs)[source]

  • Create a Request object from a string containing a cURL command. It populates the HTTP method, theURL, the headers, the cookies and the body. It accepts the samearguments as the Request class, taking preference andoverriding the values of the same arguments contained in the cURLcommand.

Unrecognized options are ignored by default. To raise an error whenfinding unknown options call this method by passingignore_unknown_options=False.

Caution

Using from_curl() from Requestsubclasses, such as JSONRequest, orXmlRpcRequest, as well as havingdownloader middlewaresandspider middlewaresenabled, such asDefaultHeadersMiddleware,UserAgentMiddleware,orHttpCompressionMiddleware,may modify the Request object.

Passing additional data to callback functions

The callback of a request is a function that will be called when the responseof that request is downloaded. The callback function will be called with thedownloaded Response object as its first argument.

Example:

  1. def parse_page1(self, response):
  2. return scrapy.Request("http://www.example.com/some_page.html",
  3. callback=self.parse_page2)
  4.  
  5. def parse_page2(self, response):
  6. # this would log http://www.example.com/some_page.html
  7. self.logger.info("Visited %s", response.url)

In some cases you may be interested in passing arguments to those callbackfunctions so you can receive the arguments later, in the second callback.The following example shows how to achieve this by using theRequest.cb_kwargs attribute:

  1. def parse(self, response):
  2. request = scrapy.Request('http://www.example.com/index.html',
  3. callback=self.parse_page2,
  4. cb_kwargs=dict(main_url=response.url))
  5. request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
  6. yield request
  7.  
  8. def parse_page2(self, response, main_url, foo):
  9. yield dict(
  10. main_url=main_url,
  11. other_url=response.url,
  12. foo=foo,
  13. )

Caution

Request.cb_kwargs was introduced in version 1.7.Prior to that, using Request.meta was recommended for passinginformation around callbacks. After 1.7, Request.cb_kwargsbecame the preferred way for handling user information, leaving Request.metafor communication with components like middlewares and extensions.

Using errbacks to catch exceptions in request processing

The errback of a request is a function that will be called when an exceptionis raise while processing it.

It receives a Failure as first parameter and canbe used to track connection establishment timeouts, DNS errors etc.

Here’s an example spider logging all errors and catching some specificerrors if needed:

  1. import scrapy
  2.  
  3. from scrapy.spidermiddlewares.httperror import HttpError
  4. from twisted.internet.error import DNSLookupError
  5. from twisted.internet.error import TimeoutError, TCPTimedOutError
  6.  
  7. class ErrbackSpider(scrapy.Spider):
  8. name = "errback_example"
  9. start_urls = [
  10. "http://www.httpbin.org/", # HTTP 200 expected
  11. "http://www.httpbin.org/status/404", # Not found error
  12. "http://www.httpbin.org/status/500", # server issue
  13. "http://www.httpbin.org:12345/", # non-responding host, timeout expected
  14. "http://www.httphttpbinbin.org/", # DNS error expected
  15. ]
  16.  
  17. def start_requests(self):
  18. for u in self.start_urls:
  19. yield scrapy.Request(u, callback=self.parse_httpbin,
  20. errback=self.errback_httpbin,
  21. dont_filter=True)
  22.  
  23. def parse_httpbin(self, response):
  24. self.logger.info('Got successful response from {}'.format(response.url))
  25. # do something useful here...
  26.  
  27. def errback_httpbin(self, failure):
  28. # log all failures
  29. self.logger.error(repr(failure))
  30.  
  31. # in case you want to do something special for some errors,
  32. # you may need the failure's type:
  33.  
  34. if failure.check(HttpError):
  35. # these exceptions come from HttpError spider middleware
  36. # you can get the non-200 response
  37. response = failure.value.response
  38. self.logger.error('HttpError on %s', response.url)
  39.  
  40. elif failure.check(DNSLookupError):
  41. # this is the original request
  42. request = failure.request
  43. self.logger.error('DNSLookupError on %s', request.url)
  44.  
  45. elif failure.check(TimeoutError, TCPTimedOutError):
  46. request = failure.request
  47. self.logger.error('TimeoutError on %s', request.url)

Request.meta special keys

The Request.meta attribute can contain any arbitrary data, but thereare some special keys recognized by Scrapy and its built-in extensions.

Those are:

bindaddress

The IP of the outgoing IP address to use for the performing the request.

download_timeout

The amount of time (in secs) that the downloader will wait before timing out.See also: DOWNLOAD_TIMEOUT.

download_latency

The amount of time spent to fetch the response, since the request has beenstarted, i.e. HTTP message sent over the network. This meta key only becomesavailable when the response has been downloaded. While most other meta keys areused to control Scrapy behavior, this one is supposed to be read-only.

download_fail_on_dataloss

Whether or not to fail on broken responses. See:DOWNLOAD_FAIL_ON_DATALOSS.

max_retry_times

The meta key is used set retry times per request. When initialized, themax_retry_times meta key takes higher precedence over theRETRY_TIMES setting.

Request subclasses

Here is the list of built-in Request subclasses. You can also subclassit to implement your own custom functionality.

FormRequest objects

The FormRequest class extends the base Request with functionality fordealing with HTML forms. It uses lxml.html forms to pre-populate formfields with form data from Response objects.

  • class scrapy.http.FormRequest(url[, formdata, ])[source]
  • The FormRequest class adds a new keyword parameter to the init method. Theremaining arguments are the same as for the Request class and arenot documented here.

Parameters:formdata (dict or __iterable of tuples) – is a dictionary (or iterable of (key, value) tuples)containing HTML Form data which will be url-encoded and assigned to thebody of the request.

The FormRequest objects support the following class method inaddition to the standard Request methods:

  • classmethod fromresponse(_response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ])[source]
  • Returns a new FormRequest object with its form field valuespre-populated with those found in the HTML <form> element containedin the given response. For an example seeUsing FormRequest.from_response() to simulate a user login.

The policy is to automatically simulate a click, by default, on any formcontrol that looks clickable, like a <input type="submit">. Eventhough this is quite convenient, and often the desired behaviour,sometimes it can cause problems which could be hard to debug. Forexample, when working with forms that are filled and/or submitted usingjavascript, the default from_response() behaviour may not be themost appropriate. To disable this behaviour you can set thedont_click argument to True. Also, if you want to change thecontrol clicked (instead of disabling it) you can also use theclickdata argument.

Caution

Using this method with select elements which have leadingor trailing whitespace in the option values will not work due to abug in lxml, which should be fixed in lxml 3.8 and above.

Parameters:

  1. - **response** ([<code>Response</code>](#scrapy.http.Response) object) the response containing a HTML form which will be usedto pre-populate the form fields
  2. - **formname** (_string_) if given, the form with name attribute set to this value will be used.
  3. - **formid** (_string_) if given, the form with id attribute set to this value will be used.
  4. - **formxpath** (_string_) if given, the first form that matches the xpath will be used.
  5. - **formcss** (_string_) if given, the first form that matches the css selector will be used.
  6. - **formnumber** (_integer_) the number of form to use, when the response containsmultiple forms. The first one (and also the default) is <code>0</code>.
  7. - **formdata** ([_dict_](https://docs.python.org/3/library/stdtypes.html#dict)) – fields to override in the form data. If a field wasalready present in the response <code>&lt;form&gt;</code> element, its value isoverridden by the one passed in this parameter. If a value passed inthis parameter is <code>None</code>, the field will not be included in therequest, even if it was present in the response <code>&lt;form&gt;</code> element.
  8. - **clickdata** ([_dict_](https://docs.python.org/3/library/stdtypes.html#dict)) – attributes to lookup the control clicked. If it’s notgiven, the form data will be submitted simulating a click on thefirst clickable element. In addition to html attributes, the controlcan be identified by its zero-based index relative to othersubmittable inputs inside the form, via the <code>nr</code> attribute.
  9. - **dont_click** (_boolean_) If True, the form data will be submitted withoutclicking in any element.

The other parameters of this class method are passed directly to theFormRequest init method.

New in version 0.10.3: The formname parameter.

New in version 0.17: The formxpath parameter.

New in version 1.1.0: The formcss parameter.

New in version 1.1.0: The formid parameter.

Request usage examples

Using FormRequest to send data via HTTP POST

If you want to simulate a HTML Form POST in your spider and send a couple ofkey-value fields, you can return a FormRequest object (from yourspider) like this:

  1. return [FormRequest(url="http://www.example.com/post/action",
  2. formdata={'name': 'John Doe', 'age': '27'},
  3. callback=self.after_post)]

Using FormRequest.from_response() to simulate a user login

It is usual for web sites to provide pre-populated form fields through <inputtype="hidden"> elements, such as session related data or authenticationtokens (for login pages). When scraping, you’ll want these fields to beautomatically pre-populated and only override a couple of them, such as theuser name and password. You can use the FormRequest.from_response()method for this job. Here’s an example spider which uses it:

  1. import scrapy
  2.  
  3. def authentication_failed(response):
  4. # TODO: Check the contents of the response and return True if it failed
  5. # or False if it succeeded.
  6. pass
  7.  
  8. class LoginSpider(scrapy.Spider):
  9. name = 'example.com'
  10. start_urls = ['http://www.example.com/users/login.php']
  11.  
  12. def parse(self, response):
  13. return scrapy.FormRequest.from_response(
  14. response,
  15. formdata={'username': 'john', 'password': 'secret'},
  16. callback=self.after_login
  17. )
  18.  
  19. def after_login(self, response):
  20. if authentication_failed(response):
  21. self.logger.error("Login failed")
  22. return
  23.  
  24. # continue scraping with authenticated session...

JsonRequest

The JsonRequest class extends the base Request class with functionality fordealing with JSON requests.

  • class scrapy.http.JsonRequest(url[, … data, dumps_kwargs])[source]
  • The JsonRequest class adds two new keyword parameters to the init method. Theremaining arguments are the same as for the Request class and arenot documented here.

Using the JsonRequest will set the Content-Type header to application/jsonand Accept header to application/json, text/javascript, /; q=0.01

Parameters:

  • data (JSON serializable object) – is any JSON serializable object that needs to be JSON encoded and assigned to body.if Request.body argument is provided this parameter will be ignored.if Request.body argument is not provided and data argument is provided Request.method will beset to 'POST' automatically.
  • dumps_kwargs (dict) – Parameters that will be passed to underlying json.dumps method which is used to serializedata into JSON format.

JsonRequest usage example

Sending a JSON POST request with a JSON payload:

  1. data = {
  2. 'name1': 'value1',
  3. 'name2': 'value2',
  4. }
  5. yield JsonRequest(url='http://www.example.com/post/action', data=data)

Response objects

  • class scrapy.http.Response(url, status=200, headers=None, body=b'', flags=None, request=None, certificate=None)[source]
  • A Response object represents an HTTP response, which is usuallydownloaded (by the Downloader) and fed to the Spiders for processing.

Parameters:

  • url (string) – the URL of this response
  • status (integer) – the HTTP status of the response. Defaults to 200.
  • headers (dict) – the headers of this response. The dict values can be strings(for single valued headers) or lists (for multi-valued headers).
  • body (bytes) – the response body. To access the decoded text as str you can useresponse.text from an encoding-awareResponse subclass,such as TextResponse.
  • flags (list) – is a list containing the initial values for theResponse.flags attribute. If given, the list will be shallowcopied.
  • request (scrapy.http.Request) – the initial value of the Response.request attribute.This represents the Request that generated this response.
  • certificate (twisted.internet.ssl.Certificate) – an object representing the server’s SSL certificate.
  • url
  • A string containing the URL of the response.

This attribute is read-only. To change the URL of a Response usereplace().

  • status
  • An integer representing the HTTP status of the response. Example: 200,404.

  • headers

  • A dictionary-like object which contains the response headers. Values canbe accessed using get() to return the first header value with thespecified name or getlist() to return all header values with thespecified name. For example, this call will give you all cookies in theheaders:
  1. response.headers.getlist('Set-Cookie')
  • body
  • The body of this Response. Keep in mind that Response.bodyis always a bytes object. If you want the unicode version useTextResponse.text (only available in TextResponseand subclasses).

This attribute is read-only. To change the body of a Response usereplace().

  • request
  • The Request object that generated this response. This attribute isassigned in the Scrapy engine, after the response and the request have passedthrough all Downloader Middlewares.In particular, this means that:

    • HTTP redirections will cause the original request (to the URL beforeredirection) to be assigned to the redirected response (with the finalURL after redirection).
    • Response.request.url doesn’t always equal Response.url
    • This attribute is only available in the spider code, and in theSpider Middlewares, but not inDownloader Middlewares (although you have the Request available there byother means) and handlers of the response_downloaded signal.
  • meta
  • A shortcut to the Request.meta attribute of theResponse.request object (i.e. self.request.meta).

Unlike the Response.request attribute, the Response.metaattribute is propagated along redirects and retries, so you will getthe original Request.meta sent from your spider.

See also

Request.meta attribute

  • cb_kwargs

New in version 2.0.

A shortcut to the Request.cb_kwargs attribute of theResponse.request object (i.e. self.request.cb_kwargs).

Unlike the Response.request attribute, theResponse.cb_kwargs attribute is propagated along redirects andretries, so you will get the original Request.cb_kwargs sentfrom your spider.

See also

Request.cb_kwargs attribute

  • flags
  • A list that contains flags for this response. Flags are labels used fortagging Responses. For example: 'cached', 'redirected’, etc. Andthey’re shown on the string representation of the Response (strmethod) which is used by the engine for logging.

  • certificate

  • A twisted.internet.ssl.Certificate object representingthe server’s SSL certificate.

Only populated for https responses, None otherwise.

  • copy()[source]
  • Returns a new Response which is a copy of this Response.

  • replace([url, status, headers, body, request, flags, cls])[source]

  • Returns a Response object with the same members, except for those membersgiven new values by whichever keyword arguments are specified. Theattribute Response.meta is copied by default.

  • urljoin(url)[source]

  • Constructs an absolute url by combining the Response’s url witha possible relative url.

This is a wrapper over urlparse.urljoin, it’s merely an alias formaking this call:

  1. urlparse.urljoin(response.url, url)
  • follow(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)[source]
  • Return a Request instance to follow a link url.It accepts the same arguments as Request.init method,but url can be a relative URL or a scrapy.link.Link object,not only an absolute URL.

TextResponse provides a follow()method which supports selectors in addition to absolute/relative URLsand Link objects.

New in version 2.0: The flags parameter.

  • followall(_urls, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)[source]

New in version 2.0.

Return an iterable of Request instances to follow all linksin urls. It accepts the same arguments as Request.init method,but elements of urls can be relative URLs or Link objects,not only absolute URLs.

TextResponse provides a follow_all()method which supports selectors in addition to absolute/relative URLsand Link objects.

Response subclasses

Here is the list of available built-in Response subclasses. You can alsosubclass the Response class to implement your own functionality.

TextResponse objects

  • class scrapy.http.TextResponse(url[, encoding[, ]])[source]
  • TextResponse objects adds encoding capabilities to the baseResponse class, which is meant to be used only for binary data,such as images, sounds or any media file.

TextResponse objects support a new init method argument, inaddition to the base Response objects. The remaining functionalityis the same as for the Response class and is not documented here.

Parameters:encoding (string) – is a string which contains the encoding to use for thisresponse. If you create a TextResponse object with a unicodebody, it will be encoded using this encoding (remember the body attributeis always a string). If encoding is None (default value), theencoding will be looked up in the response headers and body instead.

TextResponse objects support the following attributes in additionto the standard Response ones:

  • text
  • Response body, as unicode.

The same as response.body.decode(response.encoding), but theresult is cached after the first call, so you can accessresponse.text multiple times without extra overhead.

Note

unicode(response.body) is not a correct way to convert responsebody to unicode: you would be using the system default encoding(typically ascii) instead of the response encoding.

  • encoding
  • A string with the encoding of this response. The encoding is resolved bytrying the following mechanisms, in order:

    • the encoding passed in the init method encoding argument
    • the encoding declared in the Content-Type HTTP header. If thisencoding is not valid (i.e. unknown), it is ignored and the nextresolution mechanism is tried.
    • the encoding declared in the response body. The TextResponse classdoesn’t provide any special functionality for this. However, theHtmlResponse and XmlResponse classes do.
    • the encoding inferred by looking at the response body. This is the morefragile method but also the last one tried.
  • selector
  • A Selector instance using the response astarget. The selector is lazily instantiated on first access.

TextResponse objects support the following methods in addition tothe standard Response ones:

  • xpath(query)[source]
  • A shortcut to TextResponse.selector.xpath(query):
  1. response.xpath('//p')
  • css(query)[source]
  • A shortcut to TextResponse.selector.css(query):
  1. response.css('p')
  • follow(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding=None, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)[source]
  • Return a Request instance to follow a link url.It accepts the same arguments as Request.init method,but url can be not only an absolute URL, but also

  • followall(_urls=None, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding=None, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None, css=None, xpath=None)[source]

  • A generator that produces Request instances to follow alllinks in urls. It accepts the same arguments as the Request’sinit method, except that each urls element does not need to bean absolute URL, it can be any of the following:

    • a relative URL
    • a Link object, e.g. the result ofLink Extractors
    • a Selector object for a <link> or <a> element, e.g.response.css('a.my_link')[0]
    • an attribute Selector (not SelectorList), e.g.response.css('a::attr(href)')[0] orresponse.xpath('//img/@src')[0]In addition, css and xpath arguments are accepted to perform the link extractionwithin the follow_all method (only one of urls, css and xpath is accepted).

Note that when passing a SelectorList as argument for the urls parameter orusing the css or xpath parameters, this method will not produce requests forselectors from which links cannot be obtained (for instance, anchor tags without anhref attribute)

  • body_as_unicode()[source]
  • The same as text, but available as a method. This method iskept for backward compatibility; please prefer response.text.

HtmlResponse objects

XmlResponse objects