Items

The main goal in scraping is to extract structured data from unstructuredsources, typically, web pages. Scrapy spiders can return the extracted dataas Python dicts. While convenient and familiar, Python dicts lack structure:it is easy to make a typo in a field name or return inconsistent data,especially in a larger project with many spiders.

To define common output data format Scrapy provides the Item class.Item objects are simple containers used to collect the scraped data.They provide a dictionary-like API with a convenient syntax for declaringtheir available fields.

Various Scrapy components use extra information provided by Items:exporters look at declared fields to figure out columns to export,serialization can be customized using Item fields metadata, trackreftracks Item instances to help find memory leaks(see Debugging memory leaks with trackref), etc.

Declaring Items

Items are declared using a simple class definition syntax and Fieldobjects. Here is an example:

  1. import scrapy
  2.  
  3. class Product(scrapy.Item):
  4. name = scrapy.Field()
  5. price = scrapy.Field()
  6. stock = scrapy.Field()
  7. tags = scrapy.Field()
  8. last_updated = scrapy.Field(serializer=str)

Note

Those familiar with Django will notice that Scrapy Items aredeclared similar to Django Models, except that Scrapy Items are muchsimpler as there is no concept of different field types.

Item Fields

Field objects are used to specify metadata for each field. Forexample, the serializer function for the last_updated field illustrated inthe example above.

You can specify any kind of metadata for each field. There is no restriction onthe values accepted by Field objects. For this samereason, there is no reference list of all available metadata keys. Each keydefined in Field objects could be used by a different component, andonly those components know about it. You can also define and use any otherField key in your project too, for your own needs. The main goal ofField objects is to provide a way to define all field metadata in oneplace. Typically, those components whose behaviour depends on each field usecertain field keys to configure that behaviour. You must refer to theirdocumentation to see which metadata keys are used by each component.

It’s important to note that the Field objects used to declare the itemdo not stay assigned as class attributes. Instead, they can be accessed throughthe Item.fields attribute.

Working with Items

Here are some examples of common tasks performed with items, using theProduct item declared above. You willnotice the API is very similar to the dict API.

Creating items

  1. >>> product = Product(name='Desktop PC', price=1000)
  2. >>> print(product)
  3. Product(name='Desktop PC', price=1000)

Getting field values

  1. >>> product['name']
  2. Desktop PC
  3. >>> product.get('name')
  4. Desktop PC
  1. >>> product['price']
  2. 1000
  1. >>> product['last_updated']
  2. Traceback (most recent call last):
  3. ...
  4. KeyError: 'last_updated'
  1. >>> product.get('last_updated', 'not set')
  2. not set
  1. >>> product['lala'] # getting unknown field
  2. Traceback (most recent call last):
  3. ...
  4. KeyError: 'lala'
  1. >>> product.get('lala', 'unknown field')
  2. 'unknown field'
  1. >>> 'name' in product # is name field populated?
  2. True
  1. >>> 'last_updated' in product # is last_updated populated?
  2. False
  1. >>> 'last_updated' in product.fields # is last_updated a declared field?
  2. True
  1. >>> 'lala' in product.fields # is lala a declared field?
  2. False

Setting field values

  1. >>> product['last_updated'] = 'today'
  2. >>> product['last_updated']
  3. today
  1. >>> product['lala'] = 'test' # setting unknown field
  2. Traceback (most recent call last):
  3. ...
  4. KeyError: 'Product does not support field: lala'

Accessing all populated values

To access all populated values, just use the typical dict API:

  1. >>> product.keys()
  2. ['price', 'name']
  1. >>> product.items()
  2. [('price', 1000), ('name', 'Desktop PC')]

Copying items

To copy an item, you must first decide whether you want a shallow copy or adeep copy.

If your item contains mutable values like lists or dictionaries, a shallowcopy will keep references to the same mutable values across all differentcopies.

For example, if you have an item with a list of tags, and you create a shallowcopy of that item, both the original item and the copy have the same list oftags. Adding a tag to the list of one of the items will add the tag to theother item as well.

If that is not the desired behavior, use a deep copy instead.

See the documentation of the copy module for more information.

To create a shallow copy of an item, you can either callcopy() on an existing item(product2 = product.copy()) or instantiate your item class from an existingitem (product2 = Product(product)).

To create a deep copy, call deepcopy() instead(product2 = product.deepcopy()).

Other common tasks

Creating dicts from items:

  1. >>> dict(product) # create a dict from all populated values
  2. {'price': 1000, 'name': 'Desktop PC'}

Creating items from dicts:

  1. >>> Product({'name': 'Laptop PC', 'price': 1500})
  2. Product(price=1500, name='Laptop PC')
  1. >>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
  2. Traceback (most recent call last):
  3. ...
  4. KeyError: 'Product does not support field: lala'

Extending Items

You can extend Items (to add more fields or to change some metadata for somefields) by declaring a subclass of your original Item.

For example:

  1. class DiscountedProduct(Product):
  2. discount_percent = scrapy.Field(serializer=str)
  3. discount_expiration_date = scrapy.Field()

You can also extend field metadata by using the previous field metadata andappending more values, or changing existing values, like this:

  1. class SpecificProduct(Product):
  2. name = scrapy.Field(Product.fields['name'], serializer=my_serializer)

That adds (or replaces) the serializer metadata key for the name field,keeping all the previously existing metadata values.

Item objects

  • class scrapy.item.Item([arg])[source]
  • Return a new Item optionally initialized from the given argument.

Items replicate the standard dict API, including its init method, andalso provide the following additional API members:

  • copy()
  • deepcopy()
  • Return a deep copy of this item.

  • fields

  • A dictionary containing all declared fields for this Item, not onlythose populated. The keys are the field names and the values are theField objects used in the Item declaration.

Field objects

  • class scrapy.item.Field([arg])[source]
  • The Field class is just an alias to the built-in dict class anddoesn’t provide any extra functionality or attributes. In other words,Field objects are plain-old Python dicts. A separate class is usedto support the item declaration syntaxbased on class attributes.
  • class scrapy.item.BaseItem[source]
  • Base class for all scraped items.

In Scrapy, an object is considered an item if it is an instance of eitherBaseItem or dict. For example, when the output of aspider callback is evaluated, only instances of BaseItem ordict are passed to item pipelines.

If you need instances of a custom class to be considered items by Scrapy,you must inherit from either BaseItem or dict.

Unlike instances of dict, instances of BaseItem may betracked to debug memory leaks.