Data block

Open In Colab

High level API to quickly get your data in a DataLoaders

  1. /usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  2. return torch._C._cuda_getDeviceCount() > 0

class TransformBlock[source]

TransformBlock(type_tfms=None, item_tfms=None, batch_tfms=None, dl_type=None, dls_kwargs=None)

A basic wrapper that links defaults transforms for the data block API

CategoryBlock[source]

CategoryBlock(vocab=None, sort=True, add_na=False)

TransformBlock for single-label categorical targets

MultiCategoryBlock[source]

MultiCategoryBlock(encoded=False, vocab=None, add_na=False)

TransformBlock for multi-label categorical targets

RegressionBlock[source]

RegressionBlock(n_out=None)

TransformBlock for float targets

General API

  1. from fastai.vision.core import *
  2. from fastai.vision.data import *

class DataBlock[source]

DataBlock(blocks=None, dl_type=None, getters=None, n_inp=None, item_tfms=None, batch_tfms=None, get_items=None, splitter=None, get_y=None, get_x=None)

Generic container to quickly build Datasets and DataLoaders

To build a DataBlock you need to give the library four things: the types of your input/labels, and at least two functions: get_items and splitter. You may also need to include get_x and get_y or a more generic list of getters that are applied to the results of get_items.

Once those are provided, you automatically get a Datasets or a DataLoaders:

DataBlock.datasets[source]

DataBlock.datasets(source, verbose=False)

Create a Datasets object from source

DataBlock.dataloaders[source]

DataBlock.dataloaders(source, path='.', verbose=False, bs=64, shuffle=False, num_workers=None, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)

Create a DataLoaders object from source

You can create a DataBlock by passing functions:

  1. mnist = DataBlock(blocks = (ImageBlock(cls=PILImageBW),CategoryBlock),
  2. get_items = get_image_files,
  3. splitter = GrandparentSplitter(),
  4. get_y = parent_label)

Each type comes with default transforms that will be applied

  • at the base level to create items in a tuple (usually input,target) from the base elements (like filenames)
  • at the item level of the datasets
  • at the batch level

They are called respectively type transforms, item transforms, batch transforms. In the case of MNIST, the type transforms are the method to create a PILImageBW (for the input) and the Categorize transform (for the target), the item transform is ToTensor and the batch transforms are Cuda and IntToFloatTensor. You can add any other transforms by passing them in DataBlock.datasets or DataBlock.dataloaders.

  1. test_eq(mnist.type_tfms[0], [PILImageBW.create])
  2. test_eq(mnist.type_tfms[1].map(type), [Categorize])
  3. test_eq(mnist.default_item_tfms.map(type), [ToTensor])
  4. test_eq(mnist.default_batch_tfms.map(type), [IntToFloatTensor])
  1. dsets = mnist.datasets(untar_data(URLs.MNIST_TINY))
  2. test_eq(dsets.vocab, ['3', '7'])
  3. x,y = dsets.train[0]
  4. test_eq(x.size,(28,28))
  5. show_at(dsets.train, 0, cmap='Greys', figsize=(2,2));

Blocks - 图2

  1. test_fail(lambda: DataBlock(wrong_kwarg=42, wrong_kwarg2='foo'))

We can pass any number of blocks to DataBlock, we can then define what are the input and target blocks by changing n_inp. For example, defining n_inp=2 will consider the first two blocks passed as inputs and the others as targets.

  1. mnist = DataBlock((ImageBlock, ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(),
  2. get_y=parent_label)
  3. dsets = mnist.datasets(untar_data(URLs.MNIST_TINY))
  4. test_eq(mnist.n_inp, 2)
  5. test_eq(len(dsets.train[0]), 3)
  1. test_fail(lambda: DataBlock((ImageBlock, ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(),
  2. get_y=[parent_label, noop],
  3. n_inp=2), msg='get_y contains 2 functions, but must contain 1 (one for each output)')
  1. mnist = DataBlock((ImageBlock, ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(),
  2. n_inp=1,
  3. get_y=[noop, Pipeline([noop, parent_label])])
  4. dsets = mnist.datasets(untar_data(URLs.MNIST_TINY))
  5. test_eq(len(dsets.train[0]), 3)

Debugging

DataBlock.summary[source]

DataBlock.summary(source, bs=4, show_batch=False, **kwargs)

Steps through the transform pipeline for one batch, and optionally calls show_batch(**kwargs) on the transient Dataloaders.

Besides stepping through the transformation, summary() provides a shortcut dls.show_batch(...), to see the data. E.g.

  1. pets.summary(path/"images", bs=8, show_batch=True, unique=True,...)

is a shortcut to:

  1. pets.summary(path/"images", bs=8)
  2. dls = pets.dataloaders(path/"images", bs=8)
  3. dls.show_batch(unique=True,...) # See different tfms effect on the same image.

Company logo

©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021