Color Images

A colour picture is a rank-3 tensor:

In [ ]:

  1. im = image2tensor(Image.open(image_bear()))
  2. im.shape

Out[ ]:

  1. torch.Size([3, 1000, 846])

In [ ]:

  1. show_image(im);

Color Images - 图1

The first axis contains the channels, red, green, and blue:

In [ ]:

  1. _,axs = subplots(1,3)
  2. for bear,ax,color in zip(im,axs,('Reds','Greens','Blues')):
  3. show_image(255-bear, ax=ax, cmap=color)

Color Images - 图2

We saw what the convolution operation was for one filter on one channel of the image (our examples were done on a square). A convolutional layer will take an image with a certain number of channels (three for the first layer for regular RGB color images) and output an image with a different number of channels. Like our hidden size that represented the numbers of neurons in a linear layer, we can decide to have as many filters as we want, and each of them will be able to specialize, some to detect horizontal edges, others to detect vertical edges and so forth, to give something like we studied in <>.

In one sliding window, we have a certain number of channels and we need as many filters (we don’t use the same kernel for all the channels). So our kernel doesn’t have a size of 3 by 3, but ch_in (for channels in) is 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter, then sum the results (as we saw before) and sum over all the filters. In the example given in <>, the result of our conv layer on that window is red + green + blue.

Convolution over an RGB image

So, in order to apply a convolution to a color picture we require a kernel tensor with a size that matches the first axis. At each location, the corresponding parts of the kernel and the image patch are multiplied together.

These are then all added together, to produce a single number, for each grid location, for each output feature, as shown in <>.

Adding the RGB filters

Then we have ch_out filters like this, so in the end, the result of our convolutional layer will be a batch of images with ch_out channels and a height and width given by the formula outlined earlier. This give us ch_out tensors of size ch_in x ks x ks that we represent in one big tensor of four dimensions. In PyTorch, the order of the dimensions for those weights is ch_out x ch_in x ks x ks.

Additionally, we may want to have a bias for each filter. In the preceding example, the final result for our convolutional layer would be $y_{R} + y_{G} + y_{B} + b$ in that case. Like in a linear layer, there are as many bias as we have kernels, so the biases is a vector of size ch_out.

There are no special mechanisms required when setting up a CNN for training with color images. Just make sure your first layer has three inputs.

There are lots of ways of processing color images. For instance, you can change them to black and white, change from RGB to HSV (hue, saturation, and value) color space, and so forth. In general, it turns out experimentally that changing the encoding of colors won’t make any difference to your model results, as long as you don’t lose information in the transformation. So, transforming to black and white is a bad idea, since it removes the color information entirely (and this can be critical; for instance, a pet breed may have a distinctive color); but converting to HSV generally won’t make any difference.

Now you know what those pictures in <> of “what a neural net learns” from the Zeiler and Fergus paper mean! This is their picture of some of the layer 1 weights which we showed:

Layer 1 kernels found by Zeiler and Fergus

This is taking the three slices of the convolutional kernel, for each output feature, and displaying them as images. We can see that even though the creators of the neural net never explicitly created kernels to find edges, for instance, the neural net automatically discovered these features using SGD.

Now let’s see how we can train these CNNs, and show you all the techniques fastai uses under the hood for efficient training.