From Dogs and Cats to Pet Breeds

In our very first model we learned how to classify dogs versus cats. Just a few years ago this was considered a very challenging task—but today, it’s far too easy! We will not be able to show you the nuances of training models with this problem, because we get a nearly perfect result without worrying about any of the details. But it turns out that the same dataset also allows us to work on a much more challenging problem: figuring out what breed of pet is shown in each image.

In <> we presented the applications as already-solved problems. But this is not how things work in real life. We start with some dataset that we know nothing about. We then have to figure out how it is put together, how to extract the data we need from it, and what that data looks like. For the rest of this book we will be showing you how to solve these problems in practice, including all of the intermediate steps necessary to understand the data that you are working with and test your modeling as you go.

We already downloaded the Pet dataset, and we can get a path to this dataset using the same code as in <>:

In [ ]:

  1. from fastai.vision.all import *
  2. path = untar_data(URLs.PETS)

Now if we are going to understand how to extract the breed of each pet from each image we’re going to need to understand how this data is laid out. Such details of data layout are a vital piece of the deep learning puzzle. Data is usually provided in one of these two ways:

  • Individual files representing items of data, such as text documents or images, possibly organized into folders or with filenames representing information about those items
  • A table of data, such as in CSV format, where each row is an item which may include filenames providing a connection between the data in the table and data in other formats, such as text documents and images

There are exceptions to these rules—particularly in domains such as genomics, where there can be binary database formats or even network streams—but overall the vast majority of the datasets you’ll work with will use some combination of these two formats.

To see what is in our dataset we can use the ls method:

In [ ]:

  1. #hide
  2. Path.BASE_PATH = path

In [ ]:

  1. path.ls()

Out[ ]:

  1. (#3) [Path('annotations'),Path('images'),Path('models')]

We can see that this dataset provides us with images and annotations directories. The website for the dataset tells us that the annotations directory contains information about where the pets are rather than what they are. In this chapter, we will be doing classification, not localization, which is to say that we care about what the pets are, not where they are. Therefore, we will ignore the annotations directory for now. So, let’s have a look inside the images directory:

In [ ]:

  1. (path/"images").ls()

Out[ ]:

  1. (#7394) [Path('images/great_pyrenees_173.jpg'),Path('images/wheaten_terrier_46.jpg'),Path('images/Ragdoll_262.jpg'),Path('images/german_shorthaired_3.jpg'),Path('images/american_bulldog_196.jpg'),Path('images/boxer_188.jpg'),Path('images/staffordshire_bull_terrier_173.jpg'),Path('images/basset_hound_71.jpg'),Path('images/staffordshire_bull_terrier_37.jpg'),Path('images/yorkshire_terrier_18.jpg')...]

Most functions and methods in fastai that return a collection use a class called L. L can be thought of as an enhanced version of the ordinary Python list type, with added conveniences for common operations. For instance, when we display an object of this class in a notebook it appears in the format shown there. The first thing that is shown is the number of items in the collection, prefixed with a #. You’ll also see in the preceding output that the list is suffixed with an ellipsis. This means that only the first few items are displayed—which is a good thing, because we would not want more than 7,000 filenames on our screen!

By examining these filenames, we can see how they appear to be structured. Each filename contains the pet breed, and then an underscore (_), a number, and finally the file extension. We need to create a piece of code that extracts the breed from a single Path. Jupyter notebooks make this easy, because we can gradually build up something that works, and then use it for the entire dataset. We do have to be careful to not make too many assumptions at this point. For instance, if you look carefully you may notice that some of the pet breeds contain multiple words, so we cannot simply break at the first _ character that we find. To allow us to test our code, let’s pick out one of these filenames:

In [ ]:

  1. fname = (path/"images").ls()[0]

The most powerful and flexible way to extract information from strings like this is to use a regular expression, also known as a regex. A regular expression is a special string, written in the regular expression language, which specifies a general rule for deciding if another string passes a test (i.e., “matches” the regular expression), and also possibly for plucking a particular part or parts out of that other string.

In this case, we need a regular expression that extracts the pet breed from the filename.

We do not have the space to give you a complete regular expression tutorial here, but there are many excellent ones online and we know that many of you will already be familiar with this wonderful tool. If you’re not, that is totally fine—this is a great opportunity for you to rectify that! We find that regular expressions are one of the most useful tools in our programming toolkit, and many of our students tell us that this is one of the things they are most excited to learn about. So head over to Google and search for “regular expressions tutorial” now, and then come back here after you’ve had a good look around. The book’s website also provides a list of our favorites.

a: Not only are regular expressions dead handy, but they also have interesting roots. They are “regular” because they were originally examples of a “regular” language, the lowest rung within the Chomsky hierarchy, a grammar classification developed by linguist Noam Chomsky, who also wrote Syntactic Structures, the pioneering work searching for the formal grammar underlying human language. This is one of the charms of computing: it may be that the hammer you reach for every day in fact came from a spaceship.

When you are writing a regular expression, the best way to start is just to try it against one example at first. Let’s use the findall method to try a regular expression against the filename of the fname object:

In [ ]:

  1. re.findall(r'(.+)_\d+.jpg$', fname.name)

Out[ ]:

  1. ['great_pyrenees']

This regular expression plucks out all the characters leading up to the last underscore character, as long as the subsequence characters are numerical digits and then the JPEG file extension.

Now that we confirmed the regular expression works for the example, let’s use it to label the whole dataset. fastai comes with many classes to help with labeling. For labeling with regular expressions, we can use the RegexLabeller class. In this example we use the data block API we saw in <> (in fact, we nearly always use the data block API—it’s so much more flexible than the simple factory methods we saw in <>):

In [ ]:

  1. pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
  2. get_items=get_image_files,
  3. splitter=RandomSplitter(seed=42),
  4. get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
  5. item_tfms=Resize(460),
  6. batch_tfms=aug_transforms(size=224, min_scale=0.75))
  7. dls = pets.dataloaders(path/"images")

One important piece of this DataBlock call that we haven’t seen before is in these two lines:

  1. item_tfms=Resize(460),
  2. batch_tfms=aug_transforms(size=224, min_scale=0.75)

These lines implement a fastai data augmentation strategy which we call presizing. Presizing is a particular way to do image augmentation that is designed to minimize data destruction while maintaining good performance.