External data

Open In Colab

Helper functions to download the fastai datasets

  1. /usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  2. return torch._C._cuda_getDeviceCount() > 0

A complete list of datasets that are available by default inside the library are:

Main datasets:

  1. ADULT_SAMPLE: A small of the adults dataset to predict whether income exceeds $50K/yr based on census data.
  • BIWI_SAMPLE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
  1. CIFAR: The famous cifar-10 dataset which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.
  2. COCO_SAMPLE: A sample of the coco dataset for object detection.
  3. COCO_TINY: A tiny version of the coco dataset for object detection.
  • HUMAN_NUMBERS: A synthetic dataset consisting of human number counts in text such as one, two, three, four.. Useful for experimenting with Language Models.
  • IMDB: The full IMDB sentiment analysis dataset.

  • IMDB_SAMPLE: A sample of the full IMDB sentiment analysis dataset.

  • ML_SAMPLE: A movielens sample dataset for recommendation engines to recommend movies to users.

  • ML_100k: The movielens 100k dataset for recommendation engines to recommend movies to users.
  • MNIST_SAMPLE: A sample of the famous MNIST dataset consisting of handwritten digits.
  • MNIST_TINY: A tiny version of the famous MNIST dataset consisting of handwritten digits.
  • MNIST_VAR_SIZE_TINY:
  • PLANET_SAMPLE: A sample of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space.
  • PLANET_TINY: A tiny version of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space for faster experimentation and prototyping.
  • IMAGENETTE: A smaller version of the imagenet dataset pronounced just like ‘Imagenet’, except with a corny inauthentic French accent.
  • IMAGENETTE_160: The 160px version of the Imagenette dataset.
  • IMAGENETTE_320: The 320px version of the Imagenette dataset.
  • IMAGEWOOF: Imagewoof is a subset of 10 classes from Imagenet that aren’t so easy to classify, since they’re all dog breeds.
  • IMAGEWOOF_160: 160px version of the ImageWoof dataset.
  • IMAGEWOOF_320: 320px version of the ImageWoof dataset.
  • IMAGEWANG: Imagewang contains Imagenette and Imagewoof combined, but with some twists that make it into a tricky semi-supervised unbalanced classification problem
  • IMAGEWANG_160: 160px version of Imagewang.
  • IMAGEWANG_320: 320px version of Imagewang.

Kaggle competition datasets:

  1. DOGS: Image dataset consisting of dogs and cats images from Dogs vs Cats kaggle competition.

Image Classification datasets:

  1. CALTECH_101: Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc ‘Aurelio Ranzato.
  2. CARS: The Cars dataset contains 16,185 images of 196 classes of cars.
  3. CIFAR_100: The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class.
  4. CUB_200_2011: Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations
  5. FLOWERS: 17 category flower dataset by gathering images from various websites.
  6. FOOD:
  7. MNIST: MNIST dataset consisting of handwritten digits.
  8. PETS: A 37 category pet dataset with roughly 200 images for each class.

NLP datasets:

  1. AG_NEWS: The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training and 1,900 testing examples for each class.
  2. AMAZON_REVIEWS: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
  3. AMAZON_REVIEWS_POLARITY: Amazon reviews dataset for sentiment analysis.
  4. DBPEDIA: The DBpedia ontology dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
  5. MT_ENG_FRA: Machine translation dataset from English to French.
  6. SOGOU_NEWS: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks.
  7. WIKITEXT: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
  8. WIKITEXT_TINY: A tiny version of the WIKITEXT dataset.
  9. YAHOO_ANSWERS: YAHOO’s question answers dataset.
  10. YELP_REVIEWS: The Yelp dataset is a subset of YELP businesses, reviews, and user data for use in personal, educational, and academic purposes
  11. YELP_REVIEWS_POLARITY: For sentiment classification on YELP reviews.

Image localization datasets:

  1. BIWI_HEAD_POSE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
  2. CAMVID: Consists of driving labelled dataset for segmentation type models.
  3. CAMVID_TINY: A tiny camvid dataset for segmentation type models.
  4. LSUN_BEDROOMS: Large-scale Image Dataset using Deep Learning with Humans in the Loop
  5. PASCAL_2007: Pascal 2007 dataset to recognize objects from a number of visual object classes in realistic scenes.
  6. PASCAL_2012: Pascal 2012 dataset to recognize objects from a number of visual object classes in realistic scenes.

Audio classification:

  1. MACAQUES: 7285 macaque coo calls across 8 individuals from Distributed acoustic cues for caller identity in macaque vocalization.
  2. ZEBRA_FINCH: 3405 zebra finch calls classified across 11 call types. Additional labels include name of individual making the vocalization and its age.

Medical imaging datasets:

  1. SIIM_SMALL: A smaller version of the SIIM dataset where the objective is to classify pneumothorax from a set of chest radiographic images.
  2. TCGA_SMALL: A smaller version of the TCGA-OV dataset with subcutaneous and visceral fat segmentations. Citations:

    Holback, C., Jarosz, R., Prior, F., Mutch, D. G., Bhosale, P., Garcia, K., … Erickson, B. J. (2016). Radiology Data from The Cancer Genome Atlas Ovarian Cancer [TCGA-OV] collection. The Cancer Imaging Archive. http://doi.org/10.7937/K9/TCIA.2016.NDO1MDFQ

    Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. https://link.springer.com/article/10.1007/s10278-013-9622-7

Pretrained models:

  1. OPENAI_TRANSFORMER: The GPT2 Transformer pretrained weights.
  2. WT103_FWD: The WikiText-103 forward language model weights.
  3. WT103_BWD: The WikiText-103 backward language model weights.

To download any of the datasets or pretrained weights, simply run untar_data by passing any dataset name mentioned above like so:

  1. path = untar_data(URLs.PETS)
  2. path.ls()
  3. > > (#7393) [Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'),...]

To download model pretrained weights:```python path = untar_data(URLs.PETS) path.ls()

(#2) [Path(‘/home/ubuntu/.fastai/data/wt103-bwd/itos_wt103.pkl’),Path(‘/home/ubuntu/.fastai/data/wt103-bwd/lstm_bwd.pth’)] ```

Config[source]

Config(cfg_name='settings.ini')

Reading and writing settings.ini

If a config file doesn’t exist already, it is always created at ~/.fastai/config.yml location by default whenever an instance of the Config class is created. Here is a quick example to explain:

  1. config_file = Path("~/.fastai/config.yml").expanduser()
  2. if config_file.exists(): os.remove(config_file)
  3. assert not config_file.exists()
  4. config = Config()
  5. assert config_file.exists()

The config is now available as config.d:

  1. config.d
  1. {'archive_path': '/home/jhoward/.fastai/archive',
  2. 'data_path': '/home/jhoward/.fastai/data',
  3. 'model_path': '/home/jhoward/.fastai/models',
  4. 'storage_path': '/tmp',
  5. 'version': 2}

As can be seen, this is a basic config file that consists of data_path, model_path, storage_path and archive_path. All future downloads occur at the paths defined in the config file based on the type of download. For example, all future fastai datasets are downloaded to the data_path while all pretrained model weights are download to model_path unless the default download location is updated.

Please note that it is possible to update the default path locations in the config file. Let’s first create a backup of the config file, then update the config to show the changes and re update the new config with the backup file.

  1. if config_file.exists(): shutil.move(config_file, config_bak)
  2. config['archive_path'] = Path(".")
  3. config.save()
  1. config = Config()
  2. config.d
  1. {'archive_path': '.',
  2. 'data_archive_path': '/home/jhoward/.fastai/data',
  3. 'data_path': '/home/jhoward/.fastai/data',
  4. 'model_path': '/home/jhoward/.fastai/models',
  5. 'storage_path': '/tmp',
  6. 'version': 2}

The archive_path has been updated to ".". Now let’s remove any updates we made to Config file that we made for the purpose of this example.

  1. if config_bak.exists(): shutil.move(config_bak, config_file)
  2. config = Config()
  3. config.d
  1. {'archive_path': '/home/jhoward/.fastai/archive',
  2. 'data_archive_path': '/home/jhoward/.fastai/data',
  3. 'data_path': '/home/jhoward/.fastai/data',
  4. 'model_path': '/home/jhoward/.fastai/models',
  5. 'storage_path': '/tmp',
  6. 'version': 2}

class URLs[source]

URLs()

Global constants for dataset and model URLs.

The default local path is at ~/.fastai/archive/ but this can be updated by passing a different c_key. Note: c_key should be one of 'archive_path', 'data_archive_path', 'data_path', 'model_path', 'storage_path'.

  1. url = URLs.PETS
  2. local_path = URLs.path(url)
  3. test_eq(local_path.parent, Config()['archive']);
  4. local_path
  1. Path('/home/jhoward/.fastai/archive/oxford-iiit-pet.tgz')
  1. local_path = URLs.path(url, c_key='model')
  2. test_eq(local_path.parent, Config()['model'])
  3. local_path
  1. Path('/home/jhoward/.fastai/models/oxford-iiit-pet.tgz')

Downloading

download_url[source]

download_url(url, dest, overwrite=False, pbar=None, show_progress=True, chunk_size=1048576, timeout=4, retries=5)

Download url to dest unless it exists and not overwrite

The download_url is a very handy function inside fastai! This function can be used to download any file from the internet to a location passed by dest argument of the function. It should not be confused, that this function can only be used to download fastai-files. That couldn’t be further away from the truth. As an example, let’s download the pets dataset from the actual source file:

  1. fname = Path("./dog.jpg")
  2. if fname.exists(): os.remove(fname)
  3. url = "https://i.insider.com/569fdd9ac08a80bd448b7138?width=1100&format=jpeg&auto=webp"
  4. download_url(url, fname)
  5. assert fname.exists()

Let’s confirm that the file was indeed downloaded correctly.

  1. from PIL import Image
  1. im = Image.open(fname)
  2. plt.imshow(im);

External - 图2

As can be seen, the file has been downloaded to the local path provided in dest argument. Calling the function again doesn’t trigger a download since the file is already there. This can be confirmed by checking that the last modified time of the file that is downloaded doesn’t get updated.

  1. if fname.exists(): last_modified_time = os.path.getmtime(fname)
  2. download_url(url, fname)
  3. test_eq(os.path.getmtime(fname), last_modified_time)
  4. if fname.exists(): os.remove(fname)

We can also use the download_url function to download the pet’s dataset straight from the source by simply passing https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz in url.

download_data[source]

download_data(url, fname=None, c_key='archive', force_download=False, timeout=4)

Download url to fname.

The download_data is a convenience function and a wrapper outside download_url to download fastai files to the appropriate local path based on the c_key.

If fname is None, it will default to the archive folder you have in your config file (or data, model if you specify a different c_key) followed by the last part of the url: for instance URLs.MNIST_SAMPLE is http://files.fast.ai/data/examples/mnist_sample.tgz and the default value for fname will be ~/.fastai/archive/mnist_sample.tgz.

If force_download=True, the file is alwayd downloaded. Otherwise, it’s only when the file doesn’t exists that the download is triggered.

Extract

file_extract[source]

file_extract(fname, dest=None)

Extract fname to dest using tarfile or zipfile.

file_extract is used by default in untar_data to decompress the downloaded file.

newest_folder[source]

newest_folder(path)

Return newest folder on path

rename_extracted[source]

rename_extracted(dest)

Rename file if different from dest

let’s rename the untar/unzip data if dest name is different from fname

untar_data[source]

untar_data(url, fname=None, dest=None, c_key='data', force_download=False, extract_func=file_extract, timeout=4)

Download url to fname if dest doesn’t exist, and un-tgz or unzip to folder dest.

untar_data is a very powerful convenience function to download files from url to dest. The url can be a default url from the URLs class or a custom url. If dest is not passed, files are downloaded at the default_dest which defaults to ~/.fastai/data/.

This convenience function extracts the downloaded files to dest by default. In order, to simply download the files without extracting, pass the noop function as extract_func.

Note, it is also possible to pass a custom extract_func to untar_data if the filetype doesn’t end with .tgz or .zip. The gzip and zip files are supported by default and there is no need to pass custom extract_func for these type of files.

Internally, if files are not available at fname location already which defaults to ~/.fastai/archive/, the files get downloaded at ~/.fastai/archive and are then extracted at dest location. If no dest is passed the default_dest to download the files is ~/.fastai/data. If files are already available at the fname location but not available then a symbolic link is created for each file from fname location to dest.

Also, if force_download is set to True, files are re downloaded even if they exist.

  1. from tempfile import TemporaryDirectory
  1. test_eq(untar_data(URLs.MNIST_SAMPLE), config.data/'mnist_sample')
  2. with TemporaryDirectory() as d:
  3. d = Path(d)
  4. dest = untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)
  5. assert Path('mnist_tiny.tgz').exists()
  6. assert (d/'mnist_tiny').exists()
  7. os.unlink('mnist_tiny.tgz')
  8. #Test c_key
  9. tst_model = config.model/'mnist_sample'
  10. test_eq(untar_data(URLs.MNIST_SAMPLE, c_key='model'), tst_model)
  11. assert not tst_model.with_suffix('.tgz').exists() #Archive wasn't downloaded in the models path
  12. assert (config.archive/'mnist_sample.tgz').exists() #Archive was downloaded there
  13. shutil.rmtree(tst_model)

Sometimes the extracted folder does not have the same name as the downloaded file.

  1. with TemporaryDirectory() as d:
  2. d = Path(d)
  3. untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)
  4. Path('mnist_tiny.tgz').rename('nims_tini.tgz')
  5. p = Path('nims_tini.tgz')
  6. dest = Path('nims_tini')
  7. assert p.exists()
  8. file_extract(p, dest.parent)
  9. rename_extracted(dest)
  10. p.unlink()
  11. shutil.rmtree(dest)

Company logo

©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021