Datasets

As you get familiar with Machine Learning and Neural Networks you will want to use datasets that have been provided by academia, industry, government, and even other users of Caffe2. Many of these datasets have already been trained with Caffe and/or Caffe2, so you can jump right in and start using these pre-trained models. You can also fine-tune or even do “mashups” with pre-trained models by adding additional data, models, parameters, or combinations thereof to train a new custom model for your experiments. If you think you’ve found something great, then don’t hesitate to share! This is an Open Source project and we really hope to foster innovation and collaboration.

For further info on datasets and how to prepare them take a look at the Models and Datasets tutorial. You can also check out a Caffe2 Python tutorial that downloads MNIST handwriting dataset, unzips it, calls a Caffe2 provided binary that will extract/transform/load (ETL) the data into a database of key value pairs (KVPs) - in this case it uses LevelDB to store the images. The tutorial goes on to show how the dataset is used to train a neural network that can be used to identify handwriting of numbers. This tutorial is also available as a Juypter notebook.

You may also want to check out the pre-trained models at Caffe2’s Model Zoo! You might find examples there where these datasets have been used to train models, be able to draw from their project’s open source code, and be informed of dataset-specific best practices for training models.

Name Type Download
AlexNet-Places205 images > places recognition download
AN4: 948 training and 130 test utterances speech download
BSDS (300/500): 12k labeled segmentations image segmentation download images download segmentations
Celeb-A: 200k+ celebrity images, 10k+ identities celebrity images download
CIFAR-10: 60k tiny (32x32) tagged images tiny images download
COCO: A large image dataset designed for object detection, segmentation, and caption generation. coco download
CompCars: 136k+ car images & 1716 car models cars download
Oxford 102 Flowers: 102 flower categories flowers download images download segmentations
ImageNet: 14,197,122 images, 21841 synsets indexed large range of images download
ImageNet ILSVRC: Competition datasets large range of images download
Iris flowers > classification download
LSUN Scenes millions of indoor/outdoor building scenes scene classification download
LSUN Room Layout 4000 indoor scenes scene classification download
MNIST 60k handwriting training set, 10k test images handwriting download
Multi-Salient-Object (MSO) 1224 tagged salient object images tagged objects download
OUI-Adience Face Image 26,580 age & gender labeled images age, gender download
PASCAL VOC 2012 11,530 images w/ 27,450 ROI annotated objects and 6,929 segmentations images > object recognition download
PCAP Network captures of regular internet traffic and attack scenario traffic network capture download
Penn Treebank (PTB) statistical language modeling language download
UCF11/YouTube Action 11 action categories: basketball shooting, biking/cycling, diving, golf swinging, horse back riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog video > action download
UCI Datasets variety download
US Census: demographic data line graph download
VGG-Face millions of faces faces download
LibriSpeech 1000 hours free speech recognition traning dataset language download