A link extractor is an object that extracts links from responses.
__init__ method of
LxmlLinkExtractor takes settings that determine which links may be extracted.
LxmlLinkExtractor.extract_links returns a list of matching
scrapy.link.Link objects from a
The link extractor class is
scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor. For convenience it can also be imported as
from scrapy.linkextractors import LinkExtractor
scrapy.linkextractors.lxmlhtml.``LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=’a’, ‘area’, attrs=’href’, canonicalize=False, unique=True, process_value=None, strip=True)[source]
LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml’s robust HTMLParser.
allow (a regular expression (or list of**)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
deny (a regular expression (or list of**)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (i.e. not extracted). It has precedence over the
allowparameter. If not given (or empty) it won’t exclude any links.
deny_extensions (list) –
a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to
Changed in version 2.0:
restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
restrict_text (a regular expression (or list of**)) – a single regular expression (or list of regular expressions) that the link’s text must match in order to be extracted. If not given (or empty), it will match all links. If a list of regular expressions is given, the link will be extracted if it matches at least one.
attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the
tagsparameter). Defaults to
canonicalize (boolean) – canonicalize each extracted url (using w3lib.url.canonicalize_url). Defaults to
False. Note that canonicalize_url is meant for duplicate checking; it can change the URL visible at server side, so the response can be different for requests with canonicalized and raw URLs. If you’re using LinkExtractor to follow links it is more robust to keep the default
unique (boolean) – whether duplicate filtering should be applied to extracted links.
process_value (callable) –
a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return
Noneto ignore the link altogether. If not given,
lambda x: x.
For example, to extract links from this code:
You can use the following function in
strip (boolean) – whether to strip whitespaces from extracted attributes. According to HTML5 standard, leading and trailing whitespaces must be stripped from
<area>and many other elements,
<iframe>elements, etc., so LinkExtractor strips space chars by default. Set
strip=Falseto turn it off (e.g. if you’re extracting urls from elements or attributes which allow leading/trailing whitespaces).
Returns a list of
Linkobjects from the specified
Only links that match the settings passed to the
__init__method of the link extractor are returned.
Duplicate links are omitted.