Your First Model

As we said before, we will teach you how to do things before we explain why they work. Following this top-down approach, we will begin by actually training an image classifier to recognize dogs and cats with almost 100% accuracy. To train this model and run our experiments, you will need to do some initial setup. Don’t worry, it’s not as hard as it looks.

s: Do not skip the setup part even if it looks intimidating at first, especially if you have little or no experience using things like a terminal or the command line. Most of that is actually not necessary and you will find that the easiest servers can be set up with just your usual web browser. It is crucial that you run your own experiments in parallel with this book in order to learn.

Getting a GPU Deep Learning Server

To do nearly everything in this book, you’ll need access to a computer with an NVIDIA GPU (unfortunately other brands of GPU are not fully supported by the main deep learning libraries). However, we don’t recommend you buy one; in fact, even if you already have one, we don’t suggest you use it just yet! Setting up a computer takes time and energy, and you want all your energy to focus on deep learning right now. Therefore, we instead suggest you rent access to a computer that already has everything you need preinstalled and ready to go. Costs can be as little as US$0.25 per hour while you’re using it, and some options are even free.

jargon: Graphics Processing Unit (GPU): Also known as a graphics card. A special kind of processor in your computer that can handle thousands of single tasks at the same time, especially designed for displaying 3D environments on a computer for playing games. These same basic tasks are very similar to what neural networks do, such that GPUs can run neural networks hundreds of times faster than regular CPUs. All modern computers contain a GPU, but few contain the right kind of GPU necessary for deep learning.

The best choice of GPU servers to use with this book will change over time, as companies come and go and prices change. We maintain a list of our recommended options on the book’s website, so go there now and follow the instructions to get connected to a GPU deep learning server. Don’t worry, it only takes about two minutes to get set up on most platforms, and many don’t even require any payment, or even a credit card, to get started.

A: My two cents: heed this advice! If you like computers you will be tempted to set up your own box. Beware! It is feasible but surprisingly involved and distracting. There is a good reason this book is not titled, Everything You Ever Wanted to Know About Ubuntu System Administration, NVIDIA Driver Installation, apt-get, conda, pip, and Jupyter Notebook Configuration. That would be a book of its own. Having designed and deployed our production machine learning infrastructure at work, I can testify it has its satisfactions, but it is as unrelated to modeling as maintaining an airplane is to flying one.

Each option shown on the website includes a tutorial; after completing the tutorial, you will end up with a screen looking like <>.

Initial view of Jupyter Notebook

You are now ready to run your first Jupyter notebook!

jargon: Jupyter Notebook: A piece of software that allows you to include formatted text, code, images, videos, and much more, all within a single interactive document. Jupyter received the highest honor for software, the ACM Software System Award, thanks to its wide use and enormous impact in many academic fields and in industry. Jupyter Notebook is the software most widely used by data scientists for developing and interacting with deep learning models.

Running Your First Notebook

The notebooks are labeled by chapter and then by notebook number, so that they are in the same order as they are presented in this book. So, the very first notebook you will see listed is the notebook that you need to use now. You will be using this notebook to train a model that can recognize dog and cat photos. To do this, you’ll be downloading a dataset of dog and cat photos, and using that to train a model. A dataset is simply a bunch of data—it could be images, emails, financial indicators, sounds, or anything else. There are many datasets made freely available that are suitable for training models. Many of these datasets are created by academics to help advance research, many are made available for competitions (there are competitions where data scientists can compete to see who has the most accurate model!), and some are by-products of other processes (such as financial filings).

note: Full and Stripped Notebooks: There are two folders containing different versions of the notebooks. The full folder contains the exact notebooks used to create the book you’re reading now, with all the prose and outputs. The stripped version has the same headings and code cells, but all outputs and prose have been removed. After reading a section of the book, we recommend working through the stripped notebooks, with the book closed, and seeing if you can figure out what each cell will show before you execute it. Also try to recall what the code is demonstrating.

To open a notebook, just click on it. The notebook will open, and it will look something like <> (note that there may be slight differences in details across different platforms; you can ignore those differences).

An example of notebook

A notebook consists of cells. There are two main types of cell:

  • Cells containing formatted text, images, and so forth. These use a format called markdown, which you will learn about soon.
  • Cells containing code that can be executed, and outputs will appear immediately underneath (which could be plain text, tables, images, animations, sounds, or even interactive applications).

Jupyter notebooks can be in one of two modes: edit mode or command mode. In edit mode typing on your keyboard enters the letters into the cell in the usual way. However, in command mode, you will not see any flashing cursor, and the keys on your keyboard will each have a special function.

Before continuing, press the Escape key on your keyboard to switch to command mode (if you are already in command mode, this does nothing, so press it now just in case). To see a complete list of all of the functions available, press H; press Escape to remove this help screen. Notice that in command mode, unlike most programs, commands do not require you to hold down Control, Alt, or similar—you simply press the required letter key.

You can make a copy of a cell by pressing C (the cell needs to be selected first, indicated with an outline around it; if it is not already selected, click on it once). Then press V to paste a copy of it.

Click on the cell that begins with the line “# CLICK ME” to select it. The first character in that line indicates that what follows is a comment in Python, so it is ignored when executing the cell. The rest of the cell is, believe it or not, a complete system for creating and training a state-of-the-art model for recognizing cats versus dogs. So, let’s train it now! To do so, just press Shift-Enter on your keyboard, or press the Play button on the toolbar. Then wait a few minutes while the following things happen:

  1. A dataset called the Oxford-IIIT Pet Dataset that contains 7,349 images of cats and dogs from 37 different breeds will be downloaded from the fast.ai datasets collection to the GPU server you are using, and will then be extracted.
  2. A pretrained model that has already been trained on 1.3 million images, using a competition-winning model will be downloaded from the internet.
  3. The pretrained model will be fine-tuned using the latest advances in transfer learning, to create a model that is specially customized for recognizing dogs and cats.

The first two steps only need to be run once on your GPU server. If you run the cell again, it will use the dataset and model that have already been downloaded, rather than downloading them again. Let’s take a look at the contents of the cell, and the results (<>):

In [ ]:

  1. #id first_training
  2. #caption Results from the first training
  3. # CLICK ME
  4. from fastai.vision.all import *
  5. path = untar_data(URLs.PETS)/'images'
  6. def is_cat(x): return x[0].isupper()
  7. dls = ImageDataLoaders.from_name_func(
  8. path, get_image_files(path), valid_pct=0.2, seed=42,
  9. label_func=is_cat, item_tfms=Resize(224))
  10. learn = cnn_learner(dls, resnet34, metrics=error_rate)
  11. learn.fine_tune(1)
epochtrain_lossvalid_losserror_ratetime
00.1803850.0239420.00676600:16
epochtrain_lossvalid_losserror_ratetime
00.0560230.0075800.00406000:20

You will probably not see exactly the same results that are in the book. There are a lot of sources of small random variation involved in training models. We generally see an error rate of well less than 0.02 in this example, however.

important: Training Time: Depending on your network speed, it might take a few minutes to download the pretrained model and dataset. Running fine_tune might take a minute or so. Often models in this book take a few minutes to train, as will your own models, so it’s a good idea to come up with good techniques to make the most of this time. For instance, keep reading the next section while your model trains, or open up another notebook and use it for some coding experiments.

Sidebar: This Book Was Written in Jupyter Notebooks

We wrote this book using Jupyter notebooks, so for nearly every chart, table, and calculation in this book, we’ll be showing you the exact code required to replicate it yourself. That’s why very often in this book, you will see some code immediately followed by a table, a picture or just some text. If you go on the book’s website you will find all the code, and you can try running and modifying every example yourself.

You just saw how a cell that outputs a table looks inside the book. Here is an example of a cell that outputs text:

In [ ]:

  1. 1+1

Out[ ]:

  1. 2

Jupyter will always print or show the result of the last line (if there is one). For instance, here is an example of a cell that outputs an image:

In [ ]:

  1. img = PILImage.create(image_cat())
  2. img.to_thumb(192)

Out[ ]:

Your First Model - 图3

End sidebar

So, how do we know if this model is any good? In the last column of the table you can see the error rate, which is the proportion of images that were incorrectly identified. The error rate serves as our metric—our measure of model quality, chosen to be intuitive and comprehensible. As you can see, the model is nearly perfect, even though the training time was only a few seconds (not including the one-time downloading of the dataset and the pretrained model). In fact, the accuracy you’ve achieved already is far better than anybody had ever achieved just 10 years ago!

Finally, let’s check that this model actually works. Go and get a photo of a dog, or a cat; if you don’t have one handy, just search Google Images and download an image that you find there. Now execute the cell with uploader defined. It will output a button you can click, so you can select the image you want to classify:

In [ ]:

  1. #hide_output
  2. uploader = widgets.FileUpload()
  3. uploader

An upload button

Now you can pass the uploaded file to the model. Make sure that it is a clear photo of a single dog or a cat, and not a line drawing, cartoon, or similar. The notebook will tell you whether it thinks it is a dog or a cat, and how confident it is. Hopefully, you’ll find that your model did a great job:

In [ ]:

  1. #hide
  2. # For the book, we can't actually click an upload button, so we fake it
  3. uploader = SimpleNamespace(data = ['images/chapter1_cat_example.jpg'])

In [ ]:

  1. img = PILImage.create(uploader.data[0])
  2. is_cat,_,probs = learn.predict(img)
  3. print(f"Is this a cat?: {is_cat}.")
  4. print(f"Probability it's a cat: {probs[1].item():.6f}")
  1. Is this a cat?: True.
  2. Probability it's a cat: 1.000000

Congratulations on your first classifier!

But what does this mean? What did you actually do? In order to explain this, let’s zoom out again to take in the big picture.

What Is Machine Learning?

Your classifier is a deep learning model. As was already mentioned, deep learning models use neural networks, which originally date from the 1950s and have become powerful very recently thanks to recent advancements.

Another key piece of context is that deep learning is just a modern area in the more general discipline of machine learning. To understand the essence of what you did when you trained your own classification model, you don’t need to understand deep learning. It is enough to see how your model and your training process are examples of the concepts that apply to machine learning in general.

So in this section, we will describe what machine learning is. We will look at the key concepts, and show how they can be traced back to the original essay that introduced them.

Machine learning is, like regular programming, a way to get computers to complete a specific task. But how would we use regular programming to do what we just did in the last section: recognize dogs versus cats in photos? We would have to write down for the computer the exact steps necessary to complete the task.

Normally, it’s easy enough for us to write down the steps to complete a task when we’re writing a program. We just think about the steps we’d take if we had to do the task by hand, and then we translate them into code. For instance, we can write a function that sorts a list. In general, we’d write a function that looks something like <> (where inputs might be an unsorted list, and results a sorted list).

In [ ]:

  1. #hide_input
  2. #caption A traditional program
  3. #id basic_program
  4. #alt Pipeline inputs, program, results
  5. gv('''program[shape=box3d width=1 height=0.7]
  6. inputs->program->results''')

Out[ ]:

Your First Model - 图5

But for recognizing objects in a photo that’s a bit tricky; what are the steps we take when we recognize an object in a picture? We really don’t know, since it all happens in our brain without us being consciously aware of it!

Right back at the dawn of computing, in 1949, an IBM researcher named Arthur Samuel started working on a different way to get computers to complete tasks, which he called machine learning. In his classic 1962 essay “Artificial Intelligence: A Frontier of Automation”, he wrote:

: Programming a computer for such computations is, at best, a difficult task, not primarily because of any inherent complexity in the computer itself but, rather, because of the need to spell out every minute step of the process in the most exasperating detail. Computers, as any programmer will tell you, are giant morons, not giant brains.

His basic idea was this: instead of telling the computer the exact steps required to solve a problem, show it examples of the problem to solve, and let it figure out how to solve it itself. This turned out to be very effective: by 1961 his checkers-playing program had learned so much that it beat the Connecticut state champion! Here’s how he described his idea (from the same essay as above):

: Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would “learn” from its experience.

There are a number of powerful concepts embedded in this short statement:

  • The idea of a “weight assignment”
  • The fact that every weight assignment has some “actual performance”
  • The requirement that there be an “automatic means” of testing that performance,
  • The need for a “mechanism” (i.e., another automatic process) for improving the performance by changing the weight assignments

Let us take these concepts one by one, in order to understand how they fit together in practice. First, we need to understand what Samuel means by a weight assignment.

Weights are just variables, and a weight assignment is a particular choice of values for those variables. The program’s inputs are values that it processes in order to produce its results—for instance, taking image pixels as inputs, and returning the classification “dog” as a result. The program’s weight assignments are other values that define how the program will operate.

Since they will affect the program they are in a sense another kind of input, so we will update our basic picture in <> and replace it with <> in order to take this into account.

In [ ]:

  1. #hide_input
  2. #caption A program using weight assignment
  3. #id weight_assignment
  4. gv('''model[shape=box3d width=1 height=0.7]
  5. inputs->model->results; weights->model''')

Out[ ]:

Your First Model - 图6

We’ve changed the name of our box from program to model. This is to follow modern terminology and to reflect that the model is a special kind of program: it’s one that can do many different things, depending on the weights. It can be implemented in many different ways. For instance, in Samuel’s checkers program, different values of the weights would result in different checkers-playing strategies.

(By the way, what Samuel called “weights” are most generally referred to as model parameters these days, in case you have encountered that term. The term weights is reserved for a particular type of model parameter.)

Next, Samuel said we need an automatic means of testing the effectiveness of any current weight assignment in terms of actual performance. In the case of his checkers program, the “actual performance” of a model would be how well it plays. And you could automatically test the performance of two models by setting them to play against each other, and seeing which one usually wins.

Finally, he says we need a mechanism for altering the weight assignment so as to maximize the performance. For instance, we could look at the difference in weights between the winning model and the losing model, and adjust the weights a little further in the winning direction.

We can now see why he said that such a procedure could be made entirely automatic and… a machine so programmed would “learn” from its experience. Learning would become entirely automatic when the adjustment of the weights was also automatic—when instead of us improving a model by adjusting its weights manually, we relied on an automated mechanism that produced adjustments based on performance.

<> shows the full picture of Samuel’s idea of training a machine learning model.

In [ ]:

  1. #hide_input
  2. #caption Training a machine learning model
  3. #id training_loop
  4. #alt The basic training loop
  5. gv('''ordering=in
  6. model[shape=box3d width=1 height=0.7]
  7. inputs->model->results; weights->model; results->performance
  8. performance->weights[constraint=false label=update]''')

Out[ ]:

Your First Model - 图7

Notice the distinction between the model’s results (e.g., the moves in a checkers game) and its performance (e.g., whether it wins the game, or how quickly it wins).

Also note that once the model is trained—that is, once we’ve chosen our final, best, favorite weight assignment—then we can think of the weights as being part of the model, since we’re not varying them any more.

Therefore, actually using a model after it’s trained looks like <>.

In [ ]:

  1. #hide_input
  2. #caption Using a trained model as a program
  3. #id using_model
  4. gv('''model[shape=box3d width=1 height=0.7]
  5. inputs->model->results''')

Out[ ]:

Your First Model - 图8

This looks identical to our original diagram in <>, just with the word program replaced with model. This is an important insight: a trained model can be treated just like a regular computer program.

jargon: Machine Learning: The training of programs developed by allowing a computer to learn from its experience, rather than through manually coding the individual steps.

What Is a Neural Network?

It’s not too hard to imagine what the model might look like for a checkers program. There might be a range of checkers strategies encoded, and some kind of search mechanism, and then the weights could vary how strategies are selected, what parts of the board are focused on during a search, and so forth. But it’s not at all obvious what the model might look like for an image recognition program, or for understanding text, or for many other interesting problems we might imagine.

What we would like is some kind of function that is so flexible that it could be used to solve any given problem, just by varying its weights. Amazingly enough, this function actually exists! It’s the neural network, which we already discussed. That is, if you regard a neural network as a mathematical function, it turns out to be a function which is extremely flexible depending on its weights. A mathematical proof called the universal approximation theorem shows that this function can solve any problem to any level of accuracy, in theory. The fact that neural networks are so flexible means that, in practice, they are often a suitable kind of model, and you can focus your effort on the process of training them—that is, of finding good weight assignments.

But what about that process? One could imagine that you might need to find a new “mechanism” for automatically updating weights for every problem. This would be laborious. What we’d like here as well is a completely general way to update the weights of a neural network, to make it improve at any given task. Conveniently, this also exists!

This is called stochastic gradient descent (SGD). We’ll see how neural networks and SGD work in detail in <>, as well as explaining the universal approximation theorem. For now, however, we will instead use Samuel’s own words: We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would “learn” from its experience.

J: Don’t worry, neither SGD nor neural nets are mathematically complex. Both nearly entirely rely on addition and multiplication to do their work (but they do a lot of addition and multiplication!). The main reaction we hear from students when they see the details is: “Is that all it is?”

In other words, to recap, a neural network is a particular kind of machine learning model, which fits right in to Samuel’s original conception. Neural networks are special because they are highly flexible, which means they can solve an unusually wide range of problems just by finding the right weights. This is powerful, because stochastic gradient descent provides us a way to find those weight values automatically.

Having zoomed out, let’s now zoom back in and revisit our image classification problem using Samuel’s framework.

Our inputs are the images. Our weights are the weights in the neural net. Our model is a neural net. Our results are the values that are calculated by the neural net, like “dog” or “cat.”

What about the next piece, an automatic means of testing the effectiveness of any current weight assignment in terms of actual performance? Determining “actual performance” is easy enough: we can simply define our model’s performance as its accuracy at predicting the correct answers.

Putting this all together, and assuming that SGD is our mechanism for updating the weight assignments, we can see how our image classifier is a machine learning model, much like Samuel envisioned.

A Bit of Deep Learning Jargon

Samuel was working in the 1960s, and since then terminology has changed. Here is the modern deep learning terminology for all the pieces we have discussed:

  • The functional form of the model is called its architecture (but be careful—sometimes people use model as a synonym of architecture, so this can get confusing).
  • The weights are called parameters.
  • The predictions are calculated from the independent variable, which is the data not including the labels.
  • The results of the model are called predictions.
  • The measure of performance is called the loss.
  • The loss depends not only on the predictions, but also the correct labels (also known as targets or the dependent variable); e.g., “dog” or “cat.”

After making these changes, our diagram in <> looks like <>.

In [ ]:

  1. #hide_input
  2. #caption Detailed training loop
  3. #id detailed_loop
  4. gv('''ordering=in
  5. model[shape=box3d width=1 height=0.7 label=architecture]
  6. inputs->model->predictions; parameters->model; labels->loss; predictions->loss
  7. loss->parameters[constraint=false label=update]''')

Out[ ]:

Your First Model - 图9

Limitations Inherent To Machine Learning

From this picture we can now see some fundamental things about training a deep learning model:

  • A model cannot be created without data.
  • A model can only learn to operate on the patterns seen in the input data used to train it.
  • This learning approach only creates predictions, not recommended actions.
  • It’s not enough to just have examples of input data; we need labels for that data too (e.g., pictures of dogs and cats aren’t enough to train a model; we need a label for each one, saying which ones are dogs, and which are cats).

Generally speaking, we’ve seen that most organizations that say they don’t have enough data, actually mean they don’t have enough labeled data. If any organization is interested in doing something in practice with a model, then presumably they have some inputs they plan to run their model against. And presumably they’ve been doing that some other way for a while (e.g., manually, or with some heuristic program), so they have data from those processes! For instance, a radiology practice will almost certainly have an archive of medical scans (since they need to be able to check how their patients are progressing over time), but those scans may not have structured labels containing a list of diagnoses or interventions (since radiologists generally create free-text natural language reports, not structured data). We’ll be discussing labeling approaches a lot in this book, because it’s such an important issue in practice.

Since these kinds of machine learning models can only make predictions (i.e., attempt to replicate labels), this can result in a significant gap between organizational goals and model capabilities. For instance, in this book you’ll learn how to create a recommendation system that can predict what products a user might purchase. This is often used in e-commerce, such as to customize products shown on a home page by showing the highest-ranked items. But such a model is generally created by looking at a user and their buying history (inputs) and what they went on to buy or look at (labels), which means that the model is likely to tell you about products the user already has or already knows about, rather than new products that they are most likely to be interested in hearing about. That’s very different to what, say, an expert at your local bookseller might do, where they ask questions to figure out your taste, and then tell you about authors or series that you’ve never heard of before.

Another critical insight comes from considering how a model interacts with its environment. This can create feedback loops, as described here:

  • A predictive policing model is created based on where arrests have been made in the past. In practice, this is not actually predicting crime, but rather predicting arrests, and is therefore partially simply reflecting biases in existing policing processes.
  • Law enforcement officers then might use that model to decide where to focus their police activity, resulting in increased arrests in those areas.
  • Data on these additional arrests would then be fed back in to retrain future versions of the model.

This is a positive feedback loop, where the more the model is used, the more biased the data becomes, making the model even more biased, and so forth.

Feedback loops can also create problems in commercial settings. For instance, a video recommendation system might be biased toward recommending content consumed by the biggest watchers of video (e.g., conspiracy theorists and extremists tend to watch more online video content than the average), resulting in those users increasing their video consumption, resulting in more of those kinds of videos being recommended. We’ll consider this topic more in detail in <>.

Now that you have seen the base of the theory, let’s go back to our code example and see in detail how the code corresponds to the process we just described.

How Our Image Recognizer Works

Let’s see just how our image recognizer code maps to these ideas. We’ll put each line into a separate cell, and look at what each one is doing (we won’t explain every detail of every parameter yet, but will give a description of the important bits; full details will come later in the book).

The first line imports all of the fastai.vision library.

  1. from fastai.vision.all import *

This gives us all of the functions and classes we will need to create a wide variety of computer vision models.

J: A lot of Python coders recommend avoiding importing a whole library like this (using the import * syntax), because in large software projects it can cause problems. However, for interactive work such as in a Jupyter notebook, it works great. The fastai library is specially designed to support this kind of interactive use, and it will only import the necessary pieces into your environment.

The second line downloads a standard dataset from the fast.ai datasets collection (if not previously downloaded) to your server, extracts it (if not previously extracted), and returns a Path object with the extracted location:

  1. path = untar_data(URLs.PETS)/'images'

S: Throughout my time studying at fast.ai, and even still today, I’ve learned a lot about productive coding practices. The fastai library and fast.ai notebooks are full of great little tips that have helped make me a better programmer. For instance, notice that the fastai library doesn’t just return a string containing the path to the dataset, but a Path object. This is a really useful class from the Python 3 standard library that makes accessing files and directories much easier. If you haven’t come across it before, be sure to check out its documentation or a tutorial and try it out. Note that the [https://book.fast.ai\[website](https://book.fast.ai[website)\] contains links to recommended tutorials for each chapter. I’ll keep letting you know about little coding tips I’ve found useful as we come across them.

In the third line we define a function, is_cat, labels cats based on a filename rule provided by the dataset creators:

  1. def is_cat(x): return x[0].isupper()

We use that function in the fourth line, which tells fastai what kind of dataset we have, and how it is structured:

  1. dls = ImageDataLoaders.from_name_func(
  2. path, get_image_files(path), valid_pct=0.2, seed=42,
  3. label_func=is_cat, item_tfms=Resize(224))

There are various different classes for different kinds of deep learning datasets and problems—here we’re using ImageDataLoaders. The first part of the class name will generally be the type of data you have, such as image, or text.

The other important piece of information that we have to tell fastai is how to get the labels from the dataset. Computer vision datasets are normally structured in such a way that the label for an image is part of the filename, or path—most commonly the parent folder name. fastai comes with a number of standardized labeling methods, and ways to write your own. Here we’re telling fastai to use the is_cat function we just defined.

Finally, we define the Transforms that we need. A Transform contains code that is applied automatically during training; fastai includes many predefined Transforms, and adding new ones is as simple as creating a Python function. There are two kinds: item_tfms are applied to each item (in this case, each item is resized to a 224-pixel square), while batch_tfms are applied to a batch of items at a time using the GPU, so they’re particularly fast (we’ll see many examples of these throughout this book).

Why 224 pixels? This is the standard size for historical reasons (old pretrained models require this size exactly), but you can pass pretty much anything. If you increase the size, you’ll often get a model with better results (since it will be able to focus on more details), but at the price of speed and memory consumption; the opposite is true if you decrease the size.

Note: Classification and Regression: classification and regression have very specific meanings in machine learning. These are the two main types of model that we will be investigating in this book. A classification model is one which attempts to predict a class, or category. That is, it’s predicting from a number of discrete possibilities, such as “dog” or “cat.” A regression model is one which attempts to predict one or more numeric quantities, such as a temperature or a location. Sometimes people use the word regression to refer to a particular kind of model called a linear regression model; this is a bad practice, and we won’t be using that terminology in this book!

The Pet dataset contains 7,390 pictures of dogs and cats, consisting of 37 different breeds. Each image is labeled using its filename: for instance the file great_pyrenees_173.jpg is the 173rd example of an image of a Great Pyrenees breed dog in the dataset. The filenames start with an uppercase letter if the image is a cat, and a lowercase letter otherwise. We have to tell fastai how to get labels from the filenames, which we do by calling from_name_func (which means that labels can be extracted using a function applied to the filename), and passing is_cat, which returns x[0].isupper(), which evaluates to True if the first letter is uppercase (i.e., it’s a cat).

The most important parameter to mention here is valid_pct=0.2. This tells fastai to hold out 20% of the data and not use it for training the model at all. This 20% of the data is called the validation set; the remaining 80% is called the training set. The validation set is used to measure the accuracy of the model. By default, the 20% that is held out is selected randomly. The parameter seed=42 sets the random seed to the same value every time we run this code, which means we get the same validation set every time we run it—this way, if we change our model and retrain it, we know that any differences are due to the changes to the model, not due to having a different random validation set.

fastai will always show you your model’s accuracy using only the validation set, never the training set. This is absolutely critical, because if you train a large enough model for a long enough time, it will eventually memorize the label of every item in your dataset! The result will not actually be a useful model, because what we care about is how well our model works on previously unseen images. That is always our goal when creating a model: for it to be useful on data that the model only sees in the future, after it has been trained.

Even when your model has not fully memorized all your data, earlier on in training it may have memorized certain parts of it. As a result, the longer you train for, the better your accuracy will get on the training set; the validation set accuracy will also improve for a while, but eventually it will start getting worse as the model starts to memorize the training set, rather than finding generalizable underlying patterns in the data. When this happens, we say that the model is overfitting.

<> shows what happens when you overfit, using a simplified example where we have just one parameter, and some randomly generated data based on the function x**2. As you can see, although the predictions in the overfit model are accurate for data near the observed data points, they are way off when outside of that range.

Example of overfitting

Overfitting is the single most important and challenging issue when training for all machine learning practitioners, and all algorithms. As you will see, it is very easy to create a model that does a great job at making predictions on the exact data it has been trained on, but it is much harder to make accurate predictions on data the model has never seen before. And of course, this is the data that will actually matter in practice. For instance, if you create a handwritten digit classifier (as we will very soon!) and use it to recognize numbers written on checks, then you are never going to see any of the numbers that the model was trained on—check will have slightly different variations of writing to deal with. You will learn many methods to avoid overfitting in this book. However, you should only use those methods after you have confirmed that overfitting is actually occurring (i.e., you have actually observed the validation accuracy getting worse during training). We often see practitioners using over-fitting avoidance techniques even when they have enough data that they didn’t need to do so, ending up with a model that may be less accurate than what they could have achieved.

important: Validation Set: When you train a model, you must always have both a training set and a validation set, and must measure the accuracy of your model only on the validation set. If you train for too long, with not enough data, you will see the accuracy of your model start to get worse; this is called overfitting. fastai defaults valid_pct to 0.2, so even if you forget, fastai will create a validation set for you!

The fifth line of the code training our image recognizer tells fastai to create a convolutional neural network (CNN) and specifies what architecture to use (i.e. what kind of model to create), what data we want to train it on, and what metric to use:

  1. learn = cnn_learner(dls, resnet34, metrics=error_rate)

Why a CNN? It’s the current state-of-the-art approach to creating computer vision models. We’ll be learning all about how CNNs work in this book. Their structure is inspired by how the human vision system works.

There are many different architectures in fastai, which we will introduce in this book (as well as discussing how to create your own). Most of the time, however, picking an architecture isn’t a very important part of the deep learning process. It’s something that academics love to talk about, but in practice it is unlikely to be something you need to spend much time on. There are some standard architectures that work most of the time, and in this case we’re using one called ResNet that we’ll be talking a lot about during the book; it is both fast and accurate for many datasets and problems. The 34 in resnet34 refers to the number of layers in this variant of the architecture (other options are 18, 50, 101, and 152). Models using architectures with more layers take longer to train, and are more prone to overfitting (i.e. you can’t train them for as many epochs before the accuracy on the validation set starts getting worse). On the other hand, when using more data, they can be quite a bit more accurate.

What is a metric? A metric is a function that measures the quality of the model’s predictions using the validation set, and will be printed at the end of each epoch. In this case, we’re using error_rate, which is a function provided by fastai that does just what it says: tells you what percentage of images in the validation set are being classified incorrectly. Another common metric for classification is accuracy (which is just 1.0 - error_rate). fastai provides many more, which will be discussed throughout this book.

The concept of a metric may remind you of loss, but there is an important distinction. The entire purpose of loss is to define a “measure of performance” that the training system can use to update weights automatically. In other words, a good choice for loss is a choice that is easy for stochastic gradient descent to use. But a metric is defined for human consumption, so a good metric is one that is easy for you to understand, and that hews as closely as possible to what you want the model to do. At times, you might decide that the loss function is a suitable metric, but that is not necessarily the case.

cnn_learner also has a parameter pretrained, which defaults to True (so it’s used in this case, even though we haven’t specified it), which sets the weights in your model to values that have already been trained by experts to recognize a thousand different categories across 1.3 million photos (using the famous ImageNet dataset). A model that has weights that have already been trained on some other dataset is called a pretrained model. You should nearly always use a pretrained model, because it means that your model, before you’ve even shown it any of your data, is already very capable. And, as you’ll see, in a deep learning model many of these capabilities are things you’ll need, almost regardless of the details of your project. For instance, parts of pretrained models will handle edge, gradient, and color detection, which are needed for many tasks.

When using a pretrained model, cnn_learner will remove the last layer, since that is always specifically customized to the original training task (i.e. ImageNet dataset classification), and replace it with one or more new layers with randomized weights, of an appropriate size for the dataset you are working with. This last part of the model is known as the head.

Using pretrained models is the most important method we have to allow us to train more accurate models, more quickly, with less data, and less time and money. You might think that would mean that using pretrained models would be the most studied area in academic deep learning… but you’d be very, very wrong! The importance of pretrained models is generally not recognized or discussed in most courses, books, or software library features, and is rarely considered in academic papers. As we write this at the start of 2020, things are just starting to change, but it’s likely to take a while. So be careful: most people you speak to will probably greatly underestimate what you can do in deep learning with few resources, because they probably won’t deeply understand how to use pretrained models.

Using a pretrained model for a task different to what it was originally trained for is known as transfer learning. Unfortunately, because transfer learning is so under-studied, few domains have pretrained models available. For instance, there are currently few pretrained models available in medicine, making transfer learning challenging to use in that domain. In addition, it is not yet well understood how to use transfer learning for tasks such as time series analysis.

jargon: Transfer learning: Using a pretrained model for a task different to what it was originally trained for.

The sixth line of our code tells fastai how to fit the model:

  1. learn.fine_tune(1)

As we’ve discussed, the architecture only describes a template for a mathematical function; it doesn’t actually do anything until we provide values for the millions of parameters it contains.

This is the key to deep learning—determining how to fit the parameters of a model to get it to solve your problem. In order to fit a model, we have to provide at least one piece of information: how many times to look at each image (known as number of epochs). The number of epochs you select will largely depend on how much time you have available, and how long you find it takes in practice to fit your model. If you select a number that is too small, you can always train for more epochs later.

But why is the method called fine_tune, and not fit? fastai actually does have a method called fit, which does indeed fit a model (i.e. look at images in the training set multiple times, each time updating the parameters to make the predictions closer and closer to the target labels). But in this case, we’ve started with a pretrained model, and we don’t want to throw away all those capabilities that it already has. As you’ll learn in this book, there are some important tricks to adapt a pretrained model for a new dataset—a process called fine-tuning.

jargon: Fine-tuning: A transfer learning technique where the parameters of a pretrained model are updated by training for additional epochs using a different task to that used for pretraining.

When you use the fine_tune method, fastai will use these tricks for you. There are a few parameters you can set (which we’ll discuss later), but in the default form shown here, it does two steps:

  1. Use one epoch to fit just those parts of the model necessary to get the new random head to work correctly with your dataset.
  2. Use the number of epochs requested when calling the method to fit the entire model, updating the weights of the later layers (especially the head) faster than the earlier layers (which, as we’ll see, generally don’t require many changes from the pretrained weights).

The head of a model is the part that is newly added to be specific to the new dataset. An epoch is one complete pass through the dataset. After calling fit, the results after each epoch are printed, showing the epoch number, the training and validation set losses (the “measure of performance” used for training the model), and any metrics you’ve requested (error rate, in this case).

So, with all this code our model learned to recognize cats and dogs just from labeled examples. But how did it do it?

What Our Image Recognizer Learned

At this stage we have an image recognizer that is working very well, but we have no idea what it is actually doing! Although many people complain that deep learning results in impenetrable “black box” models (that is, something that gives predictions but that no one can understand), this really couldn’t be further from the truth. There is a vast body of research showing how to deeply inspect deep learning models, and get rich insights from them. Having said that, all kinds of machine learning models (including deep learning, and traditional statistical models) can be challenging to fully understand, especially when considering how they will behave when coming across data that is very different to the data used to train them. We’ll be discussing this issue throughout this book.

In 2013 a PhD student, Matt Zeiler, and his supervisor, Rob Fergus, published the paper “Visualizing and Understanding Convolutional Networks”, which showed how to visualize the neural network weights learned in each layer of a model. They carefully analyzed the model that won the 2012 ImageNet competition, and used this analysis to greatly improve the model, such that they were able to go on to win the 2013 competition! <> is the picture that they published of the first layer’s weights.

Activations of the first layer of a CNN

This picture requires some explanation. For each layer, the image part with the light gray background shows the reconstructed weights pictures, and the larger section at the bottom shows the parts of the training images that most strongly matched each set of weights. For layer 1, what we can see is that the model has discovered weights that represent diagonal, horizontal, and vertical edges, as well as various different gradients. (Note that for each layer only a subset of the features are shown; in practice there are thousands across all of the layers.) These are the basic building blocks that the model has learned for computer vision. They have been widely analyzed by neuroscientists and computer vision researchers, and it turns out that these learned building blocks are very similar to the basic visual machinery in the human eye, as well as the handcrafted computer vision features that were developed prior to the days of deep learning. The next layer is represented in <>.

Activations of the second layer of a CNN

For layer 2, there are nine examples of weight reconstructions for each of the features found by the model. We can see that the model has learned to create feature detectors that look for corners, repeating lines, circles, and other simple patterns. These are built from the basic building blocks developed in the first layer. For each of these, the right-hand side of the picture shows small patches from actual images which these features most closely match. For instance, the particular pattern in row 2, column 1 matches the gradients and textures associated with sunsets.

<> shows the image from the paper showing the results of reconstructing the features of layer 3.

Activations of the third layer of a CNN

As you can see by looking at the righthand side of this picture, the features are now able to identify and match with higher-level semantic components, such as car wheels, text, and flower petals. Using these components, layers four and five can identify even higher-level concepts, as shown in <>.

Activations of layers 4 and 5 of a CNN

This article was studying an older model called AlexNet that only contained five layers. Networks developed since then can have hundreds of layers—so you can imagine how rich the features developed by these models can be!

When we fine-tuned our pretrained model earlier, we adapted what those last layers focus on (flowers, humans, animals) to specialize on the cats versus dogs problem. More generally, we could specialize such a pretrained model on many different tasks. Let’s have a look at some examples.

Image Recognizers Can Tackle Non-Image Tasks

An image recognizer can, as its name suggests, only recognize images. But a lot of things can be represented as images, which means that an image recogniser can learn to complete many tasks.

For instance, a sound can be converted to a spectrogram, which is a chart that shows the amount of each frequency at each time in an audio file. Fast.ai student Ethan Sutin used this approach to easily beat the published accuracy of a state-of-the-art environmental sound detection model using a dataset of 8,732 urban sounds. fastai’s show_batch clearly shows how each different sound has a quite distinctive spectrogram, as you can see in <>.

show_batch with spectrograms of sounds

A time series can easily be converted into an image by simply plotting the time series on a graph. However, it is often a good idea to try to represent your data in a way that makes it as easy as possible to pull out the most important components. In a time series, things like seasonality and anomalies are most likely to be of interest. There are various transformations available for time series data. For instance, fast.ai student Ignacio Oguiza created images from a time series dataset for olive oil classification, using a technique called Gramian Angular Difference Field (GADF); you can see the result in <>. He then fed those images to an image classification model just like the one you see in this chapter. His results, despite having only 30 training set images, were well over 90% accurate, and close to the state of the art.

Converting a time series into an image

Another interesting fast.ai student project example comes from Gleb Esman. He was working on fraud detection at Splunk, using a dataset of users’ mouse movements and mouse clicks. He turned these into pictures by drawing an image where the position, speed, and acceleration of the mouse pointer was displayed using coloured lines, and the clicks were displayed using small colored circles, as shown in <>. He then fed this into an image recognition model just like the one we’ve used in this chapter, and it worked so well that it led to a patent for this approach to fraud analytics!

Converting computer mouse behavior to an image

Another example comes from the paper “Malware Classification with Deep Convolutional Neural Networks” by Mahmoud Kalash et al., which explains that “the malware binary file is divided into 8-bit sequences which are then converted to equivalent decimal values. This decimal vector is reshaped and a gray-scale image is generated that represents the malware sample,” like in <>.

Malware classification process

The authors then show “pictures” generated through this process of malware in different categories, as shown in <>.

Malware examples

As you can see, the different types of malware look very distinctive to the human eye. The model the researchers trained based on this image representation was more accurate at malware classification than any previous approach shown in the academic literature. This suggests a good rule of thumb for converting a dataset into an image representation: if the human eye can recognize categories from the images, then a deep learning model should be able to do so too.

In general, you’ll find that a small number of general approaches in deep learning can go a long way, if you’re a bit creative in how you represent your data! You shouldn’t think of approaches like the ones described here as “hacky workarounds,” because actually they often (as here) beat previously state-of-the-art results. These really are the right ways to think about these problem domains.

Jargon Recap

We just covered a lot of information so let’s recap briefly, <> provides a handy vocabulary.

  1. asciidoc
  2. [[dljargon]]
  3. .Deep learning vocabulary
  4. [options="header"]
  5. |=====
  6. | Term | Meaning
  7. |Label | The data that we're trying to predict, such as "dog" or "cat"
  8. |Architecture | The _template_ of the model that we're trying to fit; the actual mathematical function that we're passing the input data and parameters to
  9. |Model | The combination of the architecture with a particular set of parameters
  10. |Parameters | The values in the model that change what task it can do, and are updated through model training
  11. |Fit | Update the parameters of the model such that the predictions of the model using the input data match the target labels
  12. |Train | A synonym for _fit_
  13. |Pretrained model | A model that has already been trained, generally using a large dataset, and will be fine-tuned
  14. |Fine-tune | Update a pretrained model for a different task
  15. |Epoch | One complete pass through the input data
  16. |Loss | A measure of how good the model is, chosen to drive training via SGD
  17. |Metric | A measurement of how good the model is, using the validation set, chosen for human consumption
  18. |Validation set | A set of data held out from training, used only for measuring how good the model is
  19. |Training set | The data used for fitting the model; does not include any data from the validation set
  20. |Overfitting | Training a model in such a way that it _remembers_ specific features of the input data, rather than generalizing well to data not seen during training
  21. |CNN | Convolutional neural network; a type of neural network that works particularly well for computer vision tasks
  22. |=====

With this vocabulary in hand, we are now in a position to bring together all the key concepts introduced so far. Take a moment to review those definitions and read the following summary. If you can follow the explanation, then you’re well equipped to understand the discussions to come.

Machine learning is a discipline where we define a program not by writing it entirely ourselves, but by learning from data. Deep learning is a specialty within machine learning that uses neural networks with multiple layers. Image classification is a representative example (also known as image recognition). We start with labeled data; that is, a set of images where we have assigned a label to each image indicating what it represents. Our goal is to produce a program, called a model, which, given a new image, will make an accurate prediction regarding what that new image represents.

Every model starts with a choice of architecture, a general template for how that kind of model works internally. The process of training (or fitting) the model is the process of finding a set of parameter values (or weights) that specialize that general architecture into a model that works well for our particular kind of data. In order to define how well a model does on a single prediction, we need to define a loss function, which determines how we score a prediction as good or bad.

To make the training process go faster, we might start with a pretrained model—a model that has already been trained on someone else’s data. We can then adapt it to our data by training it a bit more on our data, a process called fine-tuning.

When we train a model, a key concern is to ensure that our model generalizes—that is, that it learns general lessons from our data which also apply to new items it will encounter, so that it can make good predictions on those items. The risk is that if we train our model badly, instead of learning general lessons it effectively memorizes what it has already seen, and then it will make poor predictions about new images. Such a failure is called overfitting. In order to avoid this, we always divide our data into two parts, the training set and the validation set. We train the model by showing it only the training set and then we evaluate how well the model is doing by seeing how well it performs on items from the validation set. In this way, we check if the lessons the model learns from the training set are lessons that generalize to the validation set. In order for a person to assess how well the model is doing on the validation set overall, we define a metric. During the training process, when the model has seen every item in the training set, we call that an epoch.

All these concepts apply to machine learning in general. That is, they apply to all sorts of schemes for defining a model by training it with data. What makes deep learning distinctive is a particular class of architectures: the architectures based on neural networks. In particular, tasks like image classification rely heavily on convolutional neural networks, which we will discuss shortly.