Item Loaders

Item Loaders 提供了一种简便的构件(mechanism)来抓取:ref:Items<topics-items>.虽然Items可以从它自己的类似字典(dictionary-like)的API得到所需信息 ,不过Item Loaders提供了许多更加方便的API,这些API通过自动完成那些具有共通性的任务,可从抓取进程中得到这些信息,比如预先解析提取到的原生数据。换句话来解释, Items 提供了盛装抓取到的数据的容器, 而Item Loaders提供了构件装载populating该容器。

Item Loaders 被设计用来提供一个既弹性又高效简便的构件,以扩展或重写爬虫或源格式(HTML, XML之类的)等区域的解析规则,这将不再是后期维护的噩梦。

用Item Loaders装载Items

要使用Item Loader, 你必须先将它实例化. 你可以使用类似字典的对象(例如: Item or dict)来进行实例化, 或者不使用对象也可以, 当不用对象进行实例化的时候,Item会自动使用 ItemLoader.default_item_class属性中指定的Item 类在Item Loader constructor中实例化.

然后,你开始收集数值到Item Loader时,通常使用Selectors. 你可以在同一个item field 里面添加多个数值;Item Loader将知道如何用合适的处理函数来“添加”这些数值.

下面是在 Spider 中典型的Item Loader的用法, 使用 Items chapter 中声明的 Product item:

  1. from scrapy.loader import ItemLoader
  2. from myproject.items import Product
  3.  
  4. def parse(self, response):
  5. l = ItemLoader(item=Product(), response=response)
  6. l.add_xpath('name', '//div[@class="product_name"]')
  7. l.add_xpath('name', '//div[@class="product_title"]')
  8. l.add_xpath('price', '//p[@id="price"]')
  9. l.add_css('stock', 'p#stock]')
  10. l.add_value('last_updated', 'today') # you can also use literal values
  11. return l.load_item()

快速查看这些代码之后,我们可以看到发现 name 字段被从页面中两个不同的XPath位置提取:

  • //div[@class="product_name"]
  • //div[@class="product_title"]
    换言之,数据通过用 add_xpath() 的方法,把从两个不同的XPath位置提取的数据收集起来. 这是将在以后分配给 name 字段中的数据。

之后,类似的请求被用于 pricestock 字段(后者使用 CSS selector 和 add_css() 方法),最后使用不同的方法 add_value()last_update 填充文本值( today ).

最终, 当所有数据被收集起来之后, 调用 ItemLoader.load_item() 方法, 实际上填充并且返回了之前通过调用 add_xpath(),add_css(), and add_value() 所提取和收集到的数据的Item.

Input and Output processors

Item Loader在每个(Item)字段中都包含了一个输入处理器和一个输出处理器。 输入处理器收到数据时立刻提取数据 (通过 add_xpath(), add_css() 或者add_value() 方法) 之后输入处理器的结果被收集起来并且保存在ItemLoader内. 收集到所有的数据后, 调用ItemLoader.load_item() 方法来填充,并得到填充后的Item 对象. 这是当输出处理器被和之前收集到的数据(和用输入处理器处理的)被调用.输出处理器的结果是被分配到Item的最终值。

让我们看一个例子来说明如何输入和输出处理器被一个特定的字段调用(同样适用于其他field)::

  1. l = ItemLoader(Product(), some_selector)
  2. l.add_xpath('name', xpath1) # (1)
  3. l.add_xpath('name', xpath2) # (2)
  4. l.add_css('name', css) # (3)
  5. l.add_value('name', 'test') # (4)
  6. return l.load_item() # (5)

发生了这些事情:

  • xpath1 提取出的数据,传递给 输入处理器name 字段.输入处理器的结果被收集和保存在Item Loader中(但尚未分配给该Item)。
  • xpath2 提取出来的数据,传递给(1)中使用的相同的 输入处理器 .输入处理器的结果被附加到在(1)中收集的数据(如果有的话) 。
  • This case is similar to the previous ones, except that the data is extractedfrom the css CSS selector, and passed through the same inputprocessor used in (1) and (2). The result of the input processor is appended to thedata collected in (1) and (2) (if any).
  • This case is also similar to the previous ones, except that the value to becollected is assigned directly, instead of being extracted from a XPathexpression or a CSS selector.However, the value is still passed through the input processors. In thiscase, since the value is not iterable it is converted to an iterable of asingle element before passing it to the input processor, because inputprocessor always receive iterables.
  • The data collected in steps (1), (2), (3) and (4) is passed throughthe output processor of the name field.The result of the output processor is the value assigned to the namefield in the item.
    It’s worth noticing that processors are just callable objects, which are calledwith the data to be parsed, and return a parsed value. So you can use anyfunction as input or output processor. The only requirement is that they mustaccept one (and only one) positional argument, which will be an iterator.

注解

Both input and output processors must receive an iterator as theirfirst argument. The output of those functions can be anything. The result ofinput processors will be appended to an internal list (in the Loader)containing the collected values (for that field). The result of the outputprocessors is the value that will be finally assigned to the item.

The other thing you need to keep in mind is that the values returned by inputprocessors are collected internally (in lists) and then passed to outputprocessors to populate the fields.

Last, but not least, Scrapy comes with some commonly used processors built-in for convenience.

Declaring Item Loaders

Item Loaders are declared like Items, by using a class definition syntax. Hereis an example:

  1. from scrapy.loader import ItemLoader
  2. from scrapy.loader.processors import TakeFirst, MapCompose, Join
  3.  
  4. class ProductLoader(ItemLoader):
  5.  
  6. default_output_processor = TakeFirst()
  7.  
  8. name_in = MapCompose(unicode.title)
  9. name_out = Join()
  10.  
  11. price_in = MapCompose(unicode.strip)
  12.  
  13. # ...

As you can see, input processors are declared using the _in suffix whileoutput processors are declared using the _out suffix. And you can alsodeclare a default input/output processors using theItemLoader.default_input_processor andItemLoader.default_output_processor attributes.

Declaring Input and Output Processors

As seen in the previous section, input and output processors can be declared inthe Item Loader definition, and it’s very common to declare input processorsthis way. However, there is one more place where you can specify the input andoutput processors to use: in the Item Fieldmetadata. Here is an example:

  1. import scrapy
  2.  
  3. from scrapy.loader.processors import Join, MapCompose, TakeFirst
  4. from w3lib.html import remove_tags
  5.  
  6. def filter_price(value):
  7. if value.isdigit():
  8. return value
  9.  
  10. class Product(scrapy.Item):
  11. name = scrapy.Field(
  12. input_processor=MapCompose(remove_tags),
  13. output_processor=Join(),
  14. )
  15. price = scrapy.Field(
  16. input_processor=MapCompose(remove_tags, filter_price),
  17. output_processor=TakeFirst(),
  18. )
  1. >>> from scrapy.loader import ItemLoader
  2. >>> il = ItemLoader(item=Product())
  3. >>> il.add_value('name', [u'Welcome to my', u'<strong>website</strong>'])
  4. >>> il.add_value('price', [u'&euro;', u'<span>1000</span>'])
  5. >>> il.load_item()
  6. {'name': u'Welcome to my website', 'price': u'1000'}

The precedence order, for both input and output processors, is as follows:

Item Loader Context

The Item Loader Context is a dict of arbitrary key/values which is shared amongall input and output processors in the Item Loader. It can be passed whendeclaring, instantiating or using Item Loader. They are used to modify thebehaviour of the input/output processors.

For example, suppose you have a function parse_length which receives a textvalue and extracts a length from it:

  1. def parse_length(text, loader_context):
  2. unit = loader_context.get('unit', 'm')
  3. # ... length parsing code goes here ...
  4. return parsed_length

By accepting a loader_context argument the function is explicitly tellingthe Item Loader that it’s able to receive an Item Loader context, so the ItemLoader passes the currently active context when calling it, and the processorfunction (parse_length in this case) can thus use them.

There are several ways to modify Item Loader context values:

  • By modifying the currently active Item Loader context(context attribute):
  1. loader = ItemLoader(product)
  2. loader.context['unit'] = 'cm'
  • On Item Loader instantiation (the keyword arguments of Item Loaderconstructor are stored in the Item Loader context):
  1. loader = ItemLoader(product, unit='cm')
  • On Item Loader declaration, for those input/output processors that supportinstantiating them with an Item Loader context. MapCompose is one ofthem:
  1. class ProductLoader(ItemLoader):
  2. length_out = MapCompose(parse_length, unit='cm')

ItemLoader objects

class scrapy.loader.ItemLoader([item, selector, response, ]**kwargs)

Return a new Item Loader for populating the given Item. If no item isgiven, one is instantiated automatically using the class indefault_item_class.

When instantiated with a selector or a response parametersthe ItemLoader class provides convenient mechanisms for extractingdata from web pages using selectors.

参数:
- item (Item object) – The item instance to populate using subsequent calls toadd_xpath(), add_css(),or add_value().
- selector (Selector object) – The selector to extract data from, when using theadd_xpath() (resp. add_css()) or replace_xpath()(resp. replace_css()) method.
- response (Response object) – The response used to construct the selector using thedefault_selector_class, unless the selector argument is given,in which case this argument is ignored.


The item, selector, response and the remaining keyword arguments areassigned to the Loader context (accessible through the context attribute).

ItemLoader instances have the following methods:
getvalue(_value, *processors, **kwargs)

Process the given value by the given processors and keywordarguments.

Available keyword arguments:

参数:re (str or compiled regex) – a regular expression to use for extracting data from thegiven value using extractregex() method,applied before processors


Examples:




  1. >>> from scrapy.loader.processors import TakeFirst
    >>> loader.get_value(u'name: foo', TakeFirst(), unicode.upper, re='name: (.+)')
    'FOO`



add_value(_field_name, value, *processors, **kwargs)

Process and then add the given value for the given field.

The value is first passed through get_value() by giving theprocessors and kwargs, and then passed through thefield input processor and its resultappended to the data collected for that field. If the field alreadycontains collected data, the new data is added.

The given fieldname can be None, in which case values formultiple fields may be added. And the processed value should be a dictwith field_name mapped to values.

Examples:




  1. loader.add_value('name', u'Color TV')
    loader.add_value('colours', [u'white', u'blue'])
    loader.add_value('length', u'100')
    loader.add_value('name', u'name: foo', TakeFirst(), re='name: (.+)')
    loader.add_value(None, {'name': u'foo', 'sex': u'male'})



replace_value(_field_name, value, *processors, **kwargs)

Similar to add_value() but replaces the collected data with thenew value instead of adding it.
getxpath(_xpath, *processors, **kwargs)

Similar to ItemLoader.get_value() but receives an XPath instead of avalue, which is used to extract a list of unicode strings from theselector associated with this ItemLoader.

参数:
- xpath (str) – the XPath to extract data from
- re (str or compiled regex) – a regular expression to use for extracting data from theselected XPath region


Examples:




  1. # HTML snippet: <p class="product-name">Color TV</p>
    loader.getxpath('//p[@class="product-name"]')
    # HTML snippet: <p id="price">the price is $1200</p>
    loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')



add_xpath(_field_name, xpath, *processors, **kwargs)

Similar to ItemLoader.add_value() but receives an XPath instead of avalue, which is used to extract a list of unicode strings from theselector associated with this ItemLoader.

See get_xpath() for kwargs.

参数:xpath (str) – the XPath to extract data from


Examples:




  1. # HTML snippet: <p class="product-name">Color TV</p>
    loader.addxpath('name', '//p[@class="product-name"]')
    # HTML snippet: <p id="price">the price is $1200</p>
    loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')



replace_xpath(_field_name, xpath, *processors, **kwargs)

Similar to add_xpath() but replaces collected data instead ofadding it.
getcss(_css, *processors, **kwargs)

Similar to ItemLoader.get_value() but receives a CSS selectorinstead of a value, which is used to extract a list of unicode stringsfrom the selector associated with this ItemLoader.

参数:
- css (str) – the CSS selector to extract data from
- re (str or compiled regex) – a regular expression to use for extracting data from theselected CSS region


Examples:




  1. # HTML snippet: <p class="product-name">Color TV</p>
    loader.getcss('p.product-name')
    # HTML snippet: <p id="price">the price is $1200</p>
    loader.get_css('p#price', TakeFirst(), re='the price is (.*)')



add_css(_field_name, css, *processors, **kwargs)

Similar to ItemLoader.add_value() but receives a CSS selectorinstead of a value, which is used to extract a list of unicode stringsfrom the selector associated with this ItemLoader.

See get_css() for kwargs.

参数:css (str) – the CSS selector to extract data from


Examples:




  1. # HTML snippet: <p class="product-name">Color TV</p>
    loader.addcss('name', 'p.product-name')
    # HTML snippet: <p id="price">the price is $1200</p>
    loader.add_css('price', 'p#price', re='the price is (.*)')



replace_css(_field_name, css, *processors, **kwargs)

Similar to add_css() but replaces collected data instead ofadding it.
loaditem()

Populate the item with the data collected so far, and return it. Thedata collected is first passed through the output processors to get the final value to assign to eachitem field.
get_collected_values(_field_name)

Return the collected values for the given field.
getoutput_value(_field_name)

Return the collected values parsed using the output processor, for thegiven field. This method doesn’t populate or modify the item at all.
getinput_processor(_field_name)

Return the input processor for the given field.
getoutput_processor(_field_name)

Return the output processor for the given field.

ItemLoader instances have the following attributes:
item

The Item object being parsed by this Item Loader.
context

The currently active Context of thisItem Loader.
default_item_class

An Item class (or factory), used to instantiate items when not given inthe constructor.
default_input_processor

The default input processor to use for those fields which don’t specifyone.
default_output_processor

The default output processor to use for those fields which don’t specifyone.
default_selector_class

The class used to construct the selector of thisItemLoader, if only a response is given in the constructor.If a selector is given in the constructor this attribute is ignored.This attribute is sometimes overridden in subclasses.
selector

The Selector object to extract data from.It’s either the selector given in the constructor or one created fromthe response given in the constructor using thedefault_selector_class. This attribute is meant to beread-only.

Reusing and extending Item Loaders

As your project grows bigger and acquires more and more spiders, maintenancebecomes a fundamental problem, especially when you have to deal with manydifferent parsing rules for each spider, having a lot of exceptions, but alsowanting to reuse the common processors.

Item Loaders are designed to ease the maintenance burden of parsing rules,without losing flexibility and, at the same time, providing a convenientmechanism for extending and overriding them. For this reason Item Loaderssupport traditional Python class inheritance for dealing with differences ofspecific spiders (or groups of spiders).

Suppose, for example, that some particular site encloses their product names inthree dashes (e.g. —-Plasma TV—-) and you don’t want to end up scrapingthose dashes in the final product names.

Here’s how you can remove those dashes by reusing and extending the defaultProduct Item Loader (ProductLoader):

  1. from scrapy.loader.processors import MapCompose
  2. from myproject.ItemLoaders import ProductLoader
  3.  
  4. def strip_dashes(x):
  5. return x.strip('-')
  6.  
  7. class SiteSpecificLoader(ProductLoader):
  8. name_in = MapCompose(strip_dashes, ProductLoader.name_in)

Another case where extending Item Loaders can be very helpful is when you havemultiple source formats, for example XML and HTML. In the XML version you maywant to remove CDATA occurrences. Here’s an example of how to do it:

  1. from scrapy.loader.processors import MapCompose
  2. from myproject.ItemLoaders import ProductLoader
  3. from myproject.utils.xml import remove_cdata
  4.  
  5. class XmlProductLoader(ProductLoader):
  6. name_in = MapCompose(remove_cdata, ProductLoader.name_in)

And that’s how you typically extend input processors.

As for output processors, it is more common to declare them in the field metadata,as they usually depend only on the field and not on each specific site parsingrule (as input processors do). See also:Declaring Input and Output Processors.

There are many other possible ways to extend, inherit and override your ItemLoaders, and different Item Loaders hierarchies may fit better for differentprojects. Scrapy only provides the mechanism; it doesn’t impose any specificorganization of your Loaders collection - that’s up to you and your project’sneeds.

Available built-in processors

Even though you can use any callable function as input and output processors,Scrapy provides some commonly used processors, which are described below. Someof them, like the MapCompose (which is typically used as inputprocessor) compose the output of several functions executed in order, toproduce the final parsed value.

Here is a list of all built-in processors:

class scrapy.loader.processors.Identity

The simplest processor, which doesn’t do anything. It returns the originalvalues unchanged. It doesn’t receive any constructor arguments nor acceptsLoader contexts.

Example:




  1. >>> from scrapy.loader.processors import Identity
    >>> proc = Identity()
    >>> proc(['one', 'two', 'three'])
    ['one', 'two', 'three']



class scrapy.loader.processors.TakeFirst

Returns the first non-null/non-empty value from the values received,so it’s typically used as an output processor to single-valued fields.It doesn’t receive any constructor arguments, nor accept Loader contexts.

Example:




  1. >>> from scrapy.loader.processors import TakeFirst
    >>> proc = TakeFirst()
    >>> proc(['', 'one', 'two', 'three'])
    'one'



class scrapy.loader.processors.Join(separator=u' ')

Returns the values joined with the separator given in the constructor, whichdefaults to u' '. It doesn’t accept Loader contexts.

When using the default separator, this processor is equivalent to thefunction: u' '.join

Examples:




  1. >>> from scrapy.loader.processors import Join
    >>> proc = Join()
    >>> proc(['one', 'two', 'three'])
    u'one two three'
    >>> proc = Join('<br>')
    >>> proc(['one', 'two', 'three'])
    u'one<br>two<br>three'



class scrapy.loader.processors.Compose(*functions, **default_loader_context)

A processor which is constructed from the composition of the givenfunctions. This means that each input value of this processor is passed tothe first function, and the result of that function is passed to the secondfunction, and so on, until the last function returns the output value ofthis processor.

By default, stop process on None value. This behaviour can be changed bypassing keyword argument stopon_none=False.

Example:




  1. >>> from scrapy.loader.processors import Compose
    >>> proc = Compose(lambda v: v[0], str.upper)
    >>> proc(['hello', 'world'])
    'HELLO'




Each function can optionally receive a loader_context parameter. Forthose which do, this processor will pass the currently active Loadercontext through that parameter.

The keyword arguments passed in the constructor are used as the defaultLoader context values passed to each function call. However, the finalLoader context values passed to functions are overridden with the currentlyactive Loader context accessible through the ItemLoader.context()attribute.
_class scrapy.loader.processors.MapCompose(*functions, **default_loader_context)

A processor which is constructed from the composition of the givenfunctions, similar to the Compose processor. The difference withthis processor is the way internal results are passed among functions,which is as follows:

The input value of this processor is iterated and the first function isapplied to each element. The results of these function calls (one for each element)are concatenated to construct a new iterable, which is then used to apply thesecond function, and so on, until the last function is applied to eachvalue of the list of values collected so far. The output values of the lastfunction are concatenated together to produce the output of this processor.

Each particular function can return a value or a list of values, which isflattened with the list of values returned by the same function applied tothe other input values. The functions can also return None in whichcase the output of that function is ignored for further processing over thechain.

This processor provides a convenient way to compose functions that onlywork with single values (instead of iterables). For this reason theMapCompose processor is typically used as input processor, sincedata is often extracted using theextract() method of selectors, which returns a list of unicode strings.

The example below should clarify how it works:




  1. >>> def filterworld(x):
    return None if x == 'world' else x

    >>> from scrapy.loader.processors import MapCompose
    >>> proc = MapCompose(filter_world, unicode.upper)
    >>> proc([u'hello', u'world', u'this', u'is', u'scrapy'])
    [u'HELLO, u'THIS', u'IS', u'SCRAPY']




As with the Compose processor, functions can receive Loader contexts, andconstructor keyword arguments are used as default context values. SeeCompose processor for more info.
_class scrapy.loader.processors.SelectJmes(json_path)

Queries the value using the json path provided to the constructor and returns the output.Requires jmespath (https://github.com/jmespath/jmespath.py) to run.This processor takes only one input at a time.

Example:




  1. >>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose
    >>> proc = SelectJmes("foo") #for direct use on lists and dictionaries
    >>> proc({'foo': 'bar'})
    'bar'
    >>> proc({'foo': {'bar': 'baz'}})
    {'bar': 'baz'}




Working with Json:




  1. >>> import json
    >>> proc_single_json_str = Compose(json.loads, SelectJmes("foo"))
    >>> proc_single_json_str('{"foo": "bar"}')
    u'bar'
    >>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('foo')))
    >>> proc_json_list('[{"foo":"bar"}, {"baz":"tar"}]')
    [u'bar']