First Try: Pixel Similarity

So, here is a first idea: how about we find the average pixel value for every pixel of the 3s, then do the same for the 7s. This will give us two group averages, defining what we might call the “ideal” 3 and 7. Then, to classify an image as one digit or the other, we see which of these two ideal digits the image is most similar to. This certainly seems like it should be better than nothing, so it will make a good baseline.

jargon: Baseline: A simple model which you are confident should perform reasonably well. It should be very simple to implement, and very easy to test, so that you can then test each of your improved ideas, and make sure they are always better than your baseline. Without starting with a sensible baseline, it is very difficult to know whether your super-fancy models are actually any good. One good approach to creating a baseline is doing what we have done here: think of a simple, easy-to-implement model. Another good approach is to search around to find other people that have solved similar problems to yours, and download and run their code on your dataset. Ideally, try both of these!

Step one for our simple model is to get the average of pixel values for each of our two groups. In the process of doing this, we will learn a lot of neat Python numeric programming tricks!

Let’s create a tensor containing all of our 3s stacked together. We already know how to create a tensor containing a single image. To create a tensor containing all the images in a directory, we will first use a Python list comprehension to create a plain list of the single image tensors.

We will use Jupyter to do some little checks of our work along the way—in this case, making sure that the number of returned items seems reasonable:

In [ ]:

  1. seven_tensors = [tensor(Image.open(o)) for o in sevens]
  2. three_tensors = [tensor(Image.open(o)) for o in threes]
  3. len(three_tensors),len(seven_tensors)

Out[ ]:

  1. (6131, 6265)

note: List Comprehensions: List and dictionary comprehensions are a wonderful feature of Python. Many Python programmers use them every day, including the authors of this book—they are part of “idiomatic Python.” But programmers coming from other languages may have never seen them before. There are a lot of great tutorials just a web search away, so we won’t spend a long time discussing them now. Here is a quick explanation and example to get you started. A list comprehension looks like this: new_list = [f(o) for o in a_list if o>0]. This will return every element of a_list that is greater than 0, after passing it to the function f. There are three parts here: the collection you are iterating over (a_list), an optional filter (if o>0), and something to do to each element (f(o)). It’s not only shorter to write but way faster than the alternative ways of creating the same list with a loop.

We’ll also check that one of the images looks okay. Since we now have tensors (which Jupyter by default will print as values), rather than PIL images (which Jupyter by default will display as images), we need to use fastai’s show_image function to display it:

In [ ]:

  1. show_image(three_tensors[1]);

First Try: Pixel Similarity - 图1

For every pixel position, we want to compute the average over all the images of the intensity of that pixel. To do this we first combine all the images in this list into a single three-dimensional tensor. The most common way to describe such a tensor is to call it a rank-3 tensor. We often need to stack up individual tensors in a collection into a single tensor. Unsurprisingly, PyTorch comes with a function called stack that we can use for this purpose.

Some operations in PyTorch, such as taking a mean, require us to cast our integer types to float types. Since we’ll be needing this later, we’ll also cast our stacked tensor to float now. Casting in PyTorch is as simple as typing the name of the type you wish to cast to, and treating it as a method.

Generally when images are floats, the pixel values are expected to be between 0 and 1, so we will also divide by 255 here:

In [ ]:

  1. stacked_sevens = torch.stack(seven_tensors).float()/255
  2. stacked_threes = torch.stack(three_tensors).float()/255
  3. stacked_threes.shape

Out[ ]:

  1. torch.Size([6131, 28, 28])

Perhaps the most important attribute of a tensor is its shape. This tells you the length of each axis. In this case, we can see that we have 6,131 images, each of size 28×28 pixels. There is nothing specifically about this tensor that says that the first axis is the number of images, the second is the height, and the third is the width—the semantics of a tensor are entirely up to us, and how we construct it. As far as PyTorch is concerned, it is just a bunch of numbers in memory.

The length of a tensor’s shape is its rank:

In [ ]:

  1. len(stacked_threes.shape)

Out[ ]:

  1. 3

It is really important for you to commit to memory and practice these bits of tensor jargon: rank is the number of axes or dimensions in a tensor; shape is the size of each axis of a tensor.

A: Watch out because the term “dimension” is sometimes used in two ways. Consider that we live in “three-dimensonal space” where a physical position can be described by a 3-vector v. But according to PyTorch, the attribute v.ndim (which sure looks like the “number of dimensions” of v) equals one, not three! Why? Because v is a vector, which is a tensor of rank one, meaning that it has only one axis (even if that axis has a length of three). In other words, sometimes dimension is used for the size of an axis (“space is three-dimensional”); other times, it is used for the rank, or the number of axes (“a matrix has two dimensions”). When confused, I find it helpful to translate all statements into terms of rank, axis, and length, which are unambiguous terms.

We can also get a tensor’s rank directly with ndim:

In [ ]:

  1. stacked_threes.ndim

Out[ ]:

  1. 3

Finally, we can compute what the ideal 3 looks like. We calculate the mean of all the image tensors by taking the mean along dimension 0 of our stacked, rank-3 tensor. This is the dimension that indexes over all the images.

In other words, for every pixel position, this will compute the average of that pixel over all images. The result will be one value for every pixel position, or a single image. Here it is:

In [ ]:

  1. mean3 = stacked_threes.mean(0)
  2. show_image(mean3);

First Try: Pixel Similarity - 图2

According to this dataset, this is the ideal number 3! (You may not like it, but this is what peak number 3 performance looks like.) You can see how it’s very dark where all the images agree it should be dark, but it becomes wispy and blurry where the images disagree.

Let’s do the same thing for the 7s, but put all the steps together at once to save some time:

In [ ]:

  1. mean7 = stacked_sevens.mean(0)
  2. show_image(mean7);

First Try: Pixel Similarity - 图3

Let’s now pick an arbitrary 3 and measure its distance from our “ideal digits.”

stop: Stop and Think!: How would you calculate how similar a particular image is to each of our ideal digits? Remember to step away from this book and jot down some ideas before you move on! Research shows that recall and understanding improves dramatically when you are engaged with the learning process by solving problems, experimenting, and trying new ideas yourself

Here’s a sample 3:

In [ ]:

  1. a_3 = stacked_threes[1]
  2. show_image(a_3);

First Try: Pixel Similarity - 图4

How can we determine its distance from our ideal 3? We can’t just add up the differences between the pixels of this image and the ideal digit. Some differences will be positive while others will be negative, and these differences will cancel out, resulting in a situation where an image that is too dark in some places and too light in others might be shown as having zero total differences from the ideal. That would be misleading!

To avoid this, there are two main ways data scientists measure distance in this context:

  • Take the mean of the absolute value of differences (absolute value is the function that replaces negative values with positive values). This is called the mean absolute difference or L1 norm
  • Take the mean of the square of differences (which makes everything positive) and then take the square root (which undoes the squaring). This is called the root mean squared error (RMSE) or L2 norm.

important: It’s Okay to Have Forgotten Your Math: In this book we generally assume that you have completed high school math, and remember at least some of it… But everybody forgets some things! It all depends on what you happen to have had reason to practice in the meantime. Perhaps you have forgotten what a square root is, or exactly how they work. No problem! Any time you come across a maths concept that is not explained fully in this book, don’t just keep moving on; instead, stop and look it up. Make sure you understand the basic idea, how it works, and why we might be using it. One of the best places to refresh your understanding is Khan Academy. For instance, Khan Academy has a great introduction to square roots.

Let’s try both of these now:

In [ ]:

  1. dist_3_abs = (a_3 - mean3).abs().mean()
  2. dist_3_sqr = ((a_3 - mean3)**2).mean().sqrt()
  3. dist_3_abs,dist_3_sqr

Out[ ]:

  1. (tensor(0.1114), tensor(0.2021))

In [ ]:

  1. dist_7_abs = (a_3 - mean7).abs().mean()
  2. dist_7_sqr = ((a_3 - mean7)**2).mean().sqrt()
  3. dist_7_abs,dist_7_sqr

Out[ ]:

  1. (tensor(0.1586), tensor(0.3021))

In both cases, the distance between our 3 and the “ideal” 3 is less than the distance to the ideal 7. So our simple model will give the right prediction in this case.

PyTorch already provides both of these as loss functions. You’ll find these inside torch.nn.functional, which the PyTorch team recommends importing as F (and is available by default under that name in fastai):

In [ ]:

  1. F.l1_loss(a_3.float(),mean7), F.mse_loss(a_3,mean7).sqrt()

Out[ ]:

  1. (tensor(0.1586), tensor(0.3021))

Here mse stands for mean squared error, and l1 refers to the standard mathematical jargon for mean absolute value (in math it’s called the L1 norm).

S: Intuitively, the difference between L1 norm and mean squared error (MSE) is that the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes).

J: When I first came across this “L1” thingie, I looked it up to see what on earth it meant. I found on Google that it is a vector norm using absolute value, so looked up vector norm and started reading: Given a vector space V over a field F of the real or complex numbers, a norm on V is a nonnegative-valued any function p: V → [0,+∞) with the following properties: For all a ∈ F and all u, v ∈ V, p(u + v) ≤ p(u) + p(v)… Then I stopped reading. “Ugh, I’ll never understand math!” I thought, for the thousandth time. Since then I’ve learned that every time these complex mathy bits of jargon come up in practice, it turns out I can replace them with a tiny bit of code! Like, the L1 loss is just equal to (a-b).abs().mean(), where a and b are tensors. I guess mathy folks just think differently than me… I’ll make sure in this book that every time some mathy jargon comes up, I’ll give you the little bit of code it’s equal to as well, and explain in common-sense terms what’s going on.

We just completed various mathematical operations on PyTorch tensors. If you’ve done some numeric programming in NumPy before, you may recognize these as being similar to NumPy arrays. Let’s have a look at those two very important data structures.

NumPy Arrays and PyTorch Tensors

NumPy is the most widely used library for scientific and numeric programming in Python. It provides very similar functionality and a very similar API to that provided by PyTorch; however, it does not support using the GPU or calculating gradients, which are both critical for deep learning. Therefore, in this book we will generally use PyTorch tensors instead of NumPy arrays, where possible.

(Note that fastai adds some features to NumPy and PyTorch to make them a bit more similar to each other. If any code in this book doesn’t work on your computer, it’s possible that you forgot to include a line like this at the start of your notebook: from fastai.vision.all import *.)

But what are arrays and tensors, and why should you care?

Python is slow compared to many languages. Anything fast in Python, NumPy, or PyTorch is likely to be a wrapper for a compiled object written (and optimized) in another language—specifically C. In fact, NumPy arrays and PyTorch tensors can finish computations many thousands of times faster than using pure Python.

A NumPy array is a multidimensional table of data, with all items of the same type. Since that can be any type at all, they can even be arrays of arrays, with the innermost arrays potentially being different sizes—this is called a “jagged array.” By “multidimensional table” we mean, for instance, a list (dimension of one), a table or matrix (dimension of two), a “table of tables” or “cube” (dimension of three), and so forth. If the items are all of some simple type such as integer or float, then NumPy will store them as a compact C data structure in memory. This is where NumPy shines. NumPy has a wide variety of operators and methods that can run computations on these compact structures at the same speed as optimized C, because they are written in optimized C.

A PyTorch tensor is nearly the same thing as a NumPy array, but with an additional restriction that unlocks some additional capabilities. It’s the same in that it, too, is a multidimensional table of data, with all items of the same type. However, the restriction is that a tensor cannot use just any old type—it has to use a single basic numeric type for all components. For example, a PyTorch tensor cannot be jagged. It is always a regularly shaped multidimensional rectangular structure.

The vast majority of methods and operators supported by NumPy on these structures are also supported by PyTorch, but PyTorch tensors have additional capabilities. One major capability is that these structures can live on the GPU, in which case their computation will be optimized for the GPU and can run much faster (given lots of values to work on). In addition, PyTorch can automatically calculate derivatives of these operations, including combinations of operations. As you’ll see, it would be impossible to do deep learning in practice without this capability.

S: If you don’t know what C is, don’t worry as you won’t need it at all. In a nutshell, it’s a low-level (low-level means more similar to the language that computers use internally) language that is very fast compared to Python. To take advantage of its speed while programming in Python, try to avoid as much as possible writing loops, and replace them by commands that work directly on arrays or tensors.

Perhaps the most important new coding skill for a Python programmer to learn is how to effectively use the array/tensor APIs. We will be showing lots more tricks later in this book, but here’s a summary of the key things you need to know for now.

To create an array or tensor, pass a list (or list of lists, or list of lists of lists, etc.) to array() or tensor():

In [ ]:

  1. data = [[1,2,3],[4,5,6]]
  2. arr = array (data)
  3. tns = tensor(data)

In [ ]:

  1. arr # numpy

Out[ ]:

  1. array([[1, 2, 3],
  2. [4, 5, 6]])

In [ ]:

  1. tns # pytorch

Out[ ]:

  1. tensor([[1, 2, 3],
  2. [4, 5, 6]])

All the operations that follow are shown on tensors, but the syntax and results for NumPy arrays is identical.

You can select a row (note that, like lists in Python, tensors are 0-indexed so 1 refers to the second row/column):

In [ ]:

  1. tns[1]

Out[ ]:

  1. tensor([4, 5, 6])

or a column, by using : to indicate all of the first axis (we sometimes refer to the dimensions of tensors/arrays as axes):

In [ ]:

  1. tns[:,1]

Out[ ]:

  1. tensor([2, 5])

You can combine these with Python slice syntax ([start:end] with end being excluded) to select part of a row or column:

In [ ]:

  1. tns[1,1:3]

Out[ ]:

  1. tensor([5, 6])

And you can use the standard operators such as +, -, *, /:

In [ ]:

  1. tns+1

Out[ ]:

  1. tensor([[2, 3, 4],
  2. [5, 6, 7]])

Tensors have a type:

In [ ]:

  1. tns.type()

Out[ ]:

  1. 'torch.LongTensor'

And will automatically change type as needed, for example from int to float:

In [ ]:

  1. tns*1.5

Out[ ]:

  1. tensor([[1.5000, 3.0000, 4.5000],
  2. [6.0000, 7.5000, 9.0000]])

So, is our baseline model any good? To quantify this, we must define a metric.