06 Other Computer Vision Problems - Regression - 《The fastai book》

Regression
- Assemble the Data
- Training a Model

Regression

It’s easy to think of deep learning models as being classified into domains, like computer vision, NLP, and so forth. And indeed, that’s how fastai classifies its applications—largely because that’s how most people are used to thinking of things.

But really, that’s hiding a more interesting and deeper perspective. A model is defined by its independent and dependent variables, along with its loss function. That means that there’s really a far wider array of models than just the simple domain-based split. Perhaps we have an independent variable that’s an image, and a dependent that’s text (e.g., generating a caption from an image); or perhaps we have an independent variable that’s text and dependent that’s an image (e.g., generating an image from a caption—which is actually possible for deep learning to do!); or perhaps we’ve got images, texts, and tabular data as independent variables, and we’re trying to predict product purchases… the possibilities really are endless.

To be able to move beyond fixed applications, to crafting your own novel solutions to novel problems, it helps to really understand the data block API (and maybe also the mid-tier API, which we’ll see later in the book). As an example, let’s consider the problem of image regression. This refers to learning from a dataset where the independent variable is an image, and the dependent variable is one or more floats. Often we see people treat image regression as a whole separate application—but as you’ll see here, we can treat it as just another CNN on top of the data block API.

We’re going to jump straight to a somewhat tricky variant of image regression, because we know you’re ready for it! We’re going to do a key point model. A key point refers to a specific location represented in an image—in this case, we’ll use images of people and we’ll be looking for the center of the person’s face in each image. That means we’ll actually be predicting two values for each image: the row and column of the face center.

Assemble the Data

We will use the Biwi Kinect Head Pose dataset for this section. We’ll begin by downloading the dataset as usual:

In [ ]:

path = untar_data(URLs.BIWI_HEAD_POSE)

In [ ]:

#hide
Path.BASE_PATH = path

Let’s see what we’ve got!

In [ ]:

path.ls().sorted()

Out[ ]:

(#50) [Path('01'),Path('01.obj'),Path('02'),Path('02.obj'),Path('03'),Path('03.obj'),Path('04'),Path('04.obj'),Path('05'),Path('05.obj')...]

There are 24 directories numbered from 01 to 24 (they correspond to the different people photographed), and a corresponding .obj file for each (we won’t need them here). Let’s take a look inside one of these directories:

In [ ]:

(path/'01').ls().sorted()

Out[ ]:

(#1000) [Path('01/depth.cal'),Path('01/frame_00003_pose.txt'),Path('01/frame_00003_rgb.jpg'),Path('01/frame_00004_pose.txt'),Path('01/frame_00004_rgb.jpg'),Path('01/frame_00005_pose.txt'),Path('01/frame_00005_rgb.jpg'),Path('01/frame_00006_pose.txt'),Path('01/frame_00006_rgb.jpg'),Path('01/frame_00007_pose.txt')...]

Inside the subdirectories, we have different frames, each of them come with an image (_rgb.jpg) and a pose file (_pose.txt). We can easily get all the image files recursively with get_image_files, then write a function that converts an image filename to its associated pose file:

In [ ]:

img_files = get_image_files(path)
def img2pose(x): return Path(f'{str(x)[:-7]}pose.txt')
img2pose(img_files[0])

Out[ ]:

Path('13/frame_00349_pose.txt')

Let’s take a look at our first image:

In [ ]:

im = PILImage.create(img_files[0])
im.shape

Out[ ]:

(480, 640)

In [ ]:

im.to_thumb(160)

Out[ ]:

The Biwi dataset website used to explain the format of the pose text file associated with each image, which shows the location of the center of the head. The details of this aren’t important for our purposes, so we’ll just show the function we use to extract the head center point:

In [ ]:

cal = np.genfromtxt(path/'01'/'rgb.cal', skip_footer=6)
def get_ctr(f):
    ctr = np.genfromtxt(img2pose(f), skip_header=3)
    c1 = ctr[0] * cal[0][0]/ctr[2] + cal[0][2]
    c2 = ctr[1] * cal[1][1]/ctr[2] + cal[1][2]
    return tensor([c1,c2])

This function returns the coordinates as a tensor of two items:

In [ ]:

get_ctr(img_files[0])

Out[ ]:

tensor([384.6370, 259.4787])

We can pass this function to DataBlock as get_y, since it is responsible for labeling each item. We’ll resize the images to half their input size, just to speed up training a bit.

One important point to note is that we should not just use a random splitter. The reason for this is that the same people appear in multiple images in this dataset, but we want to ensure that our model can generalize to people that it hasn’t seen yet. Each folder in the dataset contains the images for one person. Therefore, we can create a splitter function that returns true for just one person, resulting in a validation set containing just that person’s images.

The only other difference from the previous data block examples is that the second block is a PointBlock. This is necessary so that fastai knows that the labels represent coordinates; that way, it knows that when doing data augmentation, it should do the same augmentation to these coordinates as it does to the images:

In [ ]:

biwi = DataBlock(
    blocks=(ImageBlock, PointBlock),
    get_items=get_image_files,
    get_y=get_ctr,
    splitter=FuncSplitter(lambda o: o.parent.name=='13'),
    batch_tfms=[*aug_transforms(size=(240,320)), 
                Normalize.from_stats(*imagenet_stats)]
)

important: Points and Data Augmentation: We’re not aware of other libraries (except for fastai) that automatically and correctly apply data augmentation to coordinates. So, if you’re working with another library, you may need to disable data augmentation for these kinds of problems.

Before doing any modeling, we should look at our data to confirm it seems okay:

In [ ]:

dls = biwi.dataloaders(path)
dls.show_batch(max_n=9, figsize=(8,6))

That’s looking good! As well as looking at the batch visually, it’s a good idea to also look at the underlying tensors (especially as a student; it will help clarify your understanding of what your model is really seeing):

In [ ]:

xb,yb = dls.one_batch()
xb.shape,yb.shape

Out[ ]:

(torch.Size([64, 3, 240, 320]), torch.Size([64, 1, 2]))

Make sure that you understand why these are the shapes for our mini-batches.

Here’s an example of one row from the dependent variable:

In [ ]:

yb[0]

Out[ ]:

TensorPoint([[-0.3375,  0.2193]], device='cuda:6')

As you can see, we haven’t had to use a separate image regression application; all we’ve had to do is label the data, and tell fastai what kinds of data the independent and dependent variables represent.

It’s the same for creating our Learner. We will use the same function as before, with one new parameter, and we will be ready to train our model.

Training a Model

As usual, we can use cnn_learner to create our Learner. Remember way back in <> how we used y_range to tell fastai the range of our targets? We’ll do the same here (coordinates in fastai and PyTorch are always rescaled between -1 and +1):

In [ ]:

learn = cnn_learner(dls, resnet18, y_range=(-1,1))

y_range is implemented in fastai using sigmoid_range, which is defined as:

In [ ]:

def sigmoid_range(x, lo, hi): return torch.sigmoid(x) * (hi-lo) + lo

This is set as the final layer of the model, if y_range is defined. Take a moment to think about what this function does, and why it forces the model to output activations in the range (lo,hi).

Here’s what it looks like:

In [ ]:

plot_function(partial(sigmoid_range,lo=-1,hi=1), min=-4, max=4)

/home/jhoward/anaconda3/lib/python3.7/site-packages/fastbook/__init__.py:55: UserWarning: Not providing a value for linspace's steps is deprecated and will throw a runtime error in a future release. This warning will appear only once per process. (Triggered internally at  /pytorch/aten/src/ATen/native/RangeFactories.cpp:23.)
  x = torch.linspace(min,max)

We didn’t specify a loss function, which means we’re getting whatever fastai chooses as the default. Let’s see what it picked for us:

In [ ]:

dls.loss_func

Out[ ]:

FlattenedLoss of MSELoss()

This makes sense, since when coordinates are used as the dependent variable, most of the time we’re likely to be trying to predict something as close as possible; that’s basically what MSELoss (mean squared error loss) does. If you want to use a different loss function, you can pass it to cnn_learner using the loss_func parameter.

Note also that we didn’t specify any metrics. That’s because the MSE is already a useful metric for this task (although it’s probably more interpretable after we take the square root).

We can pick a good learning rate with the learning rate finder:

In [ ]:

learn.lr_find()

Out[ ]:

SuggestedLRs(lr_min=0.005754399299621582, lr_steep=0.033113110810518265)

We’ll try an LR of 1e-2:

In [ ]:

lr = 1e-2
learn.fine_tune(3, lr)

epoch	train_loss	valid_loss	time
0	0.049630	0.007602	00:42

epoch	train_loss	valid_loss	time
0	0.008714	0.004291	00:53
1	0.003213	0.000715	00:53
2	0.001482	0.000036	00:53

Generally when we run this we get a loss of around 0.0001, which corresponds to an average coordinate prediction error of:

In [ ]:

math.sqrt(0.0001)

Out[ ]:

0.01

This sounds very accurate! But it’s important to take a look at our results with Learner.show_results. The left side are the actual (ground truth) coordinates and the right side are our model’s predictions:

In [ ]:

learn.show_results(ds_idx=1, nrows=3, figsize=(6,8))

It’s quite amazing that with just a few minutes of computation we’ve created such an accurate key points model, and without any special domain-specific application. This is the power of building on flexible APIs, and using transfer learning! It’s particularly striking that we’ve been able to use transfer learning so effectively even between totally different tasks; our pretrained model was trained to do image classification, and we fine-tuned for image regression.