Building a Neural Net Layer from Scratch

Let’s start by refreshing our understanding of how matrix multiplication is used in a basic neural network. Since we’re building everything up from scratch, we’ll use nothing but plain Python initially (except for indexing into PyTorch tensors), and then replace the plain Python with PyTorch functionality once we’ve seen how to create it.

Modeling a Neuron

A neuron receives a given number of inputs and has an internal weight for each of them. It sums those weighted inputs to produce an output and adds an inner bias. In math, this can be written as:

out = \sum_{i=1}^{n} x_{i} w_{i} + b

if we name our inputs $(x_{1},\dots,x_{n})$, our weights $(w_{1},\dots,w_{n})$, and our bias $b$. In code this translates into:

  1. output = sum([x*w for x,w in zip(inputs,weights)]) + bias

This output is then fed into a nonlinear function called an activation function before being sent to another neuron. In deep learning the most common of these is the rectified Linear unit, or ReLU, which, as we’ve seen, is a fancy way of saying:

  1. def relu(x): return x if x >= 0 else 0

A deep learning model is then built by stacking a lot of those neurons in successive layers. We create a first layer with a certain number of neurons (known as hidden size) and link all the inputs to each of those neurons. Such a layer is often called a fully connected layer or a dense layer (for densely connected), or a linear layer.

It requires to compute, for each input in our batch and each neuron with a give weight, the dot product:

  1. sum([x*w for x,w in zip(input,weight)])

If you have done a little bit of linear algebra, you may remember that having a lot of those dot products happens when you do a matrix multiplication. More precisely, if our inputs are in a matrix x with a size of batch_size by n_inputs, and if we have grouped the weights of our neurons in a matrix w of size n_neurons by n_inputs (each neuron must have the same number of weights as it has inputs) and all the biases in a vector b of size n_neurons, then the output of this fully connected layer is:

  1. y = x @ w.t() + b

where @ represents the matrix product and w.t() is the transpose matrix of w. The output y is then of size batch_size by n_neurons, and in position (i,j) we have (for the mathy folks out there):

y_{i,j} = \sum_{k=1}^{n} x_{i,k} w_{k,j} + b_{j}

Or in code:

  1. y[i,j] = sum([a * b for a,b in zip(x[i,:],w[j,:])]) + b[j]

The transpose is necessary because in the mathematical definition of the matrix product m @ n, the coefficient (i,j) is:

  1. sum([a * b for a,b in zip(m[i,:],n[:,j])])

So the very basic operation we need is a matrix multiplication, as it’s what is hidden in the core of a neural net.

Matrix Multiplication from Scratch

Let’s write a function that computes the matrix product of two tensors, before we allow ourselves to use the PyTorch version of it. We will only use the indexing in PyTorch tensors:

In [ ]:

  1. import torch
  2. from torch import tensor

We’ll need three nested for loops: one for the row indices, one for the column indices, and one for the inner sum. ac and ar stand for number of columns of a and number of rows of a, respectively (the same convention is followed for b), and we make sure calculating the matrix product is possible by checking that a has as many columns as b has rows:

In [ ]:

  1. def matmul(a,b):
  2. ar,ac = a.shape # n_rows * n_cols
  3. br,bc = b.shape
  4. assert ac==br
  5. c = torch.zeros(ar, bc)
  6. for i in range(ar):
  7. for j in range(bc):
  8. for k in range(ac): c[i,j] += a[i,k] * b[k,j]
  9. return c

To test this out, we’ll pretend (using random matrices) that we’re working with a small batch of 5 MNIST images, flattened into 28×28 vectors, with linear model to turn them into 10 activations:

In [ ]:

  1. m1 = torch.randn(5,28*28)
  2. m2 = torch.randn(784,10)

Let’s time our function, using the Jupyter “magic” command %time:

In [ ]:

  1. %time t1=matmul(m1, m2)
  1. CPU times: user 1.15 s, sys: 4.09 ms, total: 1.15 s
  2. Wall time: 1.15 s

And see how that compares to PyTorch’s built-in @:

In [ ]:

  1. %timeit -n 20 [email protected]
  1. 14 µs ± 8.95 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)

As we can see, in Python three nested loops is a very bad idea! Python is a slow language, and this isn’t going to be very efficient. We see here that PyTorch is around 100,000 times faster than Python—and that’s before we even start using the GPU!

Where does this difference come from? PyTorch didn’t write its matrix multiplication in Python, but rather in C++ to make it fast. In general, whenever we do computations on tensors we will need to vectorize them so that we can take advantage of the speed of PyTorch, usually by using two techniques: elementwise arithmetic and broadcasting.

Elementwise Arithmetic

All the basic operators (+, -, *, /, >, <, ==) can be applied elementwise. That means if we write a+b for two tensors a and b that have the same shape, we will get a tensor composed of the sums the elements of a and b:

In [ ]:

  1. a = tensor([10., 6, -4])
  2. b = tensor([2., 8, 7])
  3. a + b

Out[ ]:

  1. tensor([12., 14., 3.])

The Booleans operators will return an array of Booleans:

In [ ]:

  1. a < b

Out[ ]:

  1. tensor([False, True, True])

If we want to know if every element of a is less than the corresponding element in b, or if two tensors are equal, we need to combine those elementwise operations with torch.all:

In [ ]:

  1. (a < b).all(), (a==b).all()

Out[ ]:

  1. (tensor(False), tensor(False))

Reduction operations like all(), sum() and mean() return tensors with only one element, called rank-0 tensors. If you want to convert this to a plain Python Boolean or number, you need to call .item():

In [ ]:

  1. (a + b).mean().item()

Out[ ]:

  1. 9.666666984558105

The elementwise operations work on tensors of any rank, as long as they have the same shape:

In [ ]:

  1. m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
  2. m*m

Out[ ]:

  1. tensor([[ 1., 4., 9.],
  2. [16., 25., 36.],
  3. [49., 64., 81.]])

However you can’t perform elementwise operations on tensors that don’t have the same shape (unless they are broadcastable, as discussed in the next section):

In [ ]:

  1. n = tensor([[1., 2, 3], [4,5,6]])
  2. m*n
  1. ------------------------------------------------------------------------
  2. RuntimeError Traceback (most recent call last)
  3. <ipython-input-12-add73c4f74e0> in <module>
  4. 1 n = tensor([[1., 2, 3], [4,5,6]])
  5. ----> 2 m*n
  6. RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0

With elementwise arithmetic, we can remove one of our three nested loops: we can multiply the tensors that correspond to the i-th row of a and the j-th column of b before summing all the elements, which will speed things up because the inner loop will now be executed by PyTorch at C speed.

To access one column or row, we can simply write a[i,:] or b[:,j]. The : means take everything in that dimension. We could restrict this and take only a slice of that particular dimension by passing a range, like 1:5, instead of just :. In that case, we would take the elements in columns or rows 1 to 4 (the second number is noninclusive).

One simplification is that we can always omit a trailing colon, so a[i,:] can be abbreviated to a[i]. With all of that in mind, we can write a new version of our matrix multiplication:

In [ ]:

  1. def matmul(a,b):
  2. ar,ac = a.shape
  3. br,bc = b.shape
  4. assert ac==br
  5. c = torch.zeros(ar, bc)
  6. for i in range(ar):
  7. for j in range(bc): c[i,j] = (a[i] * b[:,j]).sum()
  8. return c

In [ ]:

  1. %timeit -n 20 t3 = matmul(m1,m2)
  1. 1.7 ms ± 88.1 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)

We’re already ~700 times faster, just by removing that inner for loop! And that’s just the beginning—with broadcasting we can remove another loop and get an even more important speed up.

Broadcasting

As we discussed in <>, broadcasting is a term introduced by the NumPy library that describes how tensors of different ranks are treated during arithmetic operations. For instance, it’s obvious there is no way to add a 3×3 matrix with a 4×5 matrix, but what if we want to add one scalar (which can be represented as a 1×1 tensor) with a matrix? Or a vector of size 3 with a 3×4 matrix? In both cases, we can find a way to make sense of this operation.

Broadcasting gives specific rules to codify when shapes are compatible when trying to do an elementwise operation, and how the tensor of the smaller shape is expanded to match the tensor of the bigger shape. It’s essential to master those rules if you want to be able to write code that executes quickly. In this section, we’ll expand our previous treatment of broadcasting to understand these rules.

Broadcasting with a scalar

Broadcasting with a scalar is the easiest type of broadcasting. When we have a tensor a and a scalar, we just imagine a tensor of the same shape as a filled with that scalar and perform the operation:

In [ ]:

  1. a = tensor([10., 6, -4])
  2. a > 0

Out[ ]:

  1. tensor([ True, True, False])

How are we able to do this comparison? 0 is being broadcast to have the same dimensions as a. Note that this is done without creating a tensor full of zeros in memory (that would be very inefficient).

This is very useful if you want to normalize your dataset by subtracting the mean (a scalar) from the entire data set (a matrix) and dividing by the standard deviation (another scalar):

In [ ]:

  1. m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
  2. (m - 5) / 2.73

Out[ ]:

  1. tensor([[-1.4652, -1.0989, -0.7326],
  2. [-0.3663, 0.0000, 0.3663],
  3. [ 0.7326, 1.0989, 1.4652]])

What if have different means for each row of the matrix? in that case you will need to broadcast a vector to a matrix.

Broadcasting a vector to a matrix

We can broadcast a vector to a matrix as follows:

In [ ]:

  1. c = tensor([10.,20,30])
  2. m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
  3. m.shape,c.shape

Out[ ]:

  1. (torch.Size([3, 3]), torch.Size([3]))

In [ ]:

  1. m + c

Out[ ]:

  1. tensor([[11., 22., 33.],
  2. [14., 25., 36.],
  3. [17., 28., 39.]])

Here the elements of c are expanded to make three rows that match, making the operation possible. Again, PyTorch doesn’t actually create three copies of c in memory. This is done by the expand_as method behind the scenes:

In [ ]:

  1. c.expand_as(m)

Out[ ]:

  1. tensor([[10., 20., 30.],
  2. [10., 20., 30.],
  3. [10., 20., 30.]])

If we look at the corresponding tensor, we can ask for its storage property (which shows the actual contents of the memory used for the tensor) to check there is no useless data stored:

In [ ]:

  1. t = c.expand_as(m)
  2. t.storage()

Out[ ]:

  1. 10.0
  2. 20.0
  3. 30.0
  4. [torch.FloatStorage of size 3]

Even though the tensor officially has nine elements, only three scalars are stored in memory. This is possible thanks to the clever trick of giving that dimension a stride of 0 (which means that when PyTorch looks for the next row by adding the stride, it doesn’t move):

In [ ]:

  1. t.stride(), t.shape

Out[ ]:

  1. ((0, 1), torch.Size([3, 3]))

Since m is of size 3×3, there are two ways to do broadcasting. The fact it was done on the last dimension is a convention that comes from the rules of broadcasting and has nothing to do with the way we ordered our tensors. If instead we do this, we get the same result:

In [ ]:

  1. c + m

Out[ ]:

  1. tensor([[11., 22., 33.],
  2. [14., 25., 36.],
  3. [17., 28., 39.]])

In fact, it’s only possible to broadcast a vector of size n with a matrix of size m by n:

In [ ]:

  1. c = tensor([10.,20,30])
  2. m = tensor([[1., 2, 3], [4,5,6]])
  3. c+m

Out[ ]:

  1. tensor([[11., 22., 33.],
  2. [14., 25., 36.]])

This won’t work:

In [ ]:

  1. c = tensor([10.,20])
  2. m = tensor([[1., 2, 3], [4,5,6]])
  3. c+m
  1. ------------------------------------------------------------------------
  2. RuntimeError Traceback (most recent call last)
  3. <ipython-input-25-64bbbad4d99c> in <module>
  4. 1 c = tensor([10.,20])
  5. 2 m = tensor([[1., 2, 3], [4,5,6]])
  6. ----> 3 c+m
  7. RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1

If we want to broadcast in the other dimension, we have to change the shape of our vector to make it a 3×1 matrix. This is done with the unsqueeze method in PyTorch:

In [ ]:

  1. c = tensor([10.,20,30])
  2. m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
  3. c = c.unsqueeze(1)
  4. m.shape,c.shape

Out[ ]:

  1. (torch.Size([3, 3]), torch.Size([3, 1]))

This time, c is expanded on the column side:

In [ ]:

  1. c+m

Out[ ]:

  1. tensor([[11., 12., 13.],
  2. [24., 25., 26.],
  3. [37., 38., 39.]])

Like before, only three scalars are stored in memory:

In [ ]:

  1. t = c.expand_as(m)
  2. t.storage()

Out[ ]:

  1. 10.0
  2. 20.0
  3. 30.0
  4. [torch.FloatStorage of size 3]

And the expanded tensor has the right shape because the column dimension has a stride of 0:

In [ ]:

  1. t.stride(), t.shape

Out[ ]:

  1. ((1, 0), torch.Size([3, 3]))

With broadcasting, by default if we need to add dimensions, they are added at the beginning. When we were broadcasting before, Pytorch was doing c.unsqueeze(0) behind the scenes:

In [ ]:

  1. c = tensor([10.,20,30])
  2. c.shape, c.unsqueeze(0).shape,c.unsqueeze(1).shape

Out[ ]:

  1. (torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))

The unsqueeze command can be replaced by None indexing:

In [ ]:

  1. c.shape, c[None,:].shape,c[:,None].shape

Out[ ]:

  1. (torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))

You can always omit trailing colons, and ... means all preceding dimensions:

In [ ]:

  1. c[None].shape,c[...,None].shape

Out[ ]:

  1. (torch.Size([1, 3]), torch.Size([3, 1]))

With this, we can remove another for loop in our matrix multiplication function. Now, instead of multiplying a[i] with b[:,j], we can multiply a[i] with the whole matrix b using broadcasting, then sum the results:

In [ ]:

  1. def matmul(a,b):
  2. ar,ac = a.shape
  3. br,bc = b.shape
  4. assert ac==br
  5. c = torch.zeros(ar, bc)
  6. for i in range(ar):
  7. # c[i,j] = (a[i,:] * b[:,j]).sum() # previous
  8. c[i] = (a[i ].unsqueeze(-1) * b).sum(dim=0)
  9. return c

In [ ]:

  1. %timeit -n 20 t4 = matmul(m1,m2)
  1. 357 µs ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)

We’re now 3,700 times faster than our first implementation! Before we move on, let’s discuss the rules of broadcasting in a little more detail.

Broadcasting rules

When operating on two tensors, PyTorch compares their shapes elementwise. It starts with the trailing dimensions and works its way backward, adding 1 when it meets empty dimensions. Two dimensions are compatible when one of the following is true:

  • They are equal.
  • One of them is 1, in which case that dimension is broadcast to make it the same as the other.

Arrays do not need to have the same number of dimensions. For example, if you have a 256×256×3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with three values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:

  1. Image (3d tensor): 256 x 256 x 3
  2. Scale (1d tensor): (1) (1) 3
  3. Result (3d tensor): 256 x 256 x 3

However, a 2D tensor of size 256×256 isn’t compatible with our image:

  1. Image (3d tensor): 256 x 256 x 3
  2. Scale (2d tensor): (1) 256 x 256
  3. Error

In our earlier examples we had with a 3×3 matrix and a vector of size 3, broadcasting was done on the rows:

  1. Matrix (2d tensor): 3 x 3
  2. Vector (1d tensor): (1) 3
  3. Result (2d tensor): 3 x 3

As an exercise, try to determine what dimensions to add (and where) when you need to normalize a batch of images of size 64 x 3 x 256 x 256 with vectors of three elements (one for the mean and one for the standard deviation).

Another useful way of simplifying tensor manipulations is the use of Einstein summations convention.

Einstein Summation

Before using the PyTorch operation @ or torch.matmul, there is one last way we can implement matrix multiplication: Einstein summation (einsum). This is a compact representation for combining products and sums in a general way. We write an equation like this:

  1. ik,kj -> ij

The lefthand side represents the operands dimensions, separated by commas. Here we have two tensors that each have two dimensions (i,k and k,j). The righthand side represents the result dimensions, so here we have a tensor with two dimensions i,j.

The rules of Einstein summation notation are as follows:

  1. Repeated indices on the left side are implicitly summed over if they are not on the right side.
  2. Each index can appear at most twice on the left side.
  3. The unrepeated indices on the left side must appear on the right side.

So in our example, since k is repeated, we sum over that index. In the end the formula represents the matrix obtained when we put in (i,j) the sum of all the coefficients (i,k) in the first tensor multiplied by the coefficients (k,j) in the second tensor… which is the matrix product! Here is how we can code this in PyTorch:

In [ ]:

  1. def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)

Einstein summation is a very practical way of expressing operations involving indexing and sum of products. Note that you can have just one member on the lefthand side. For instance, this:

  1. torch.einsum('ij->ji', a)

returns the transpose of the matrix a. You can also have three or more members. This:

  1. torch.einsum('bi,ij,bj->b', a, b, c)

will return a vector of size b where the k-th coordinate is the sum of a[k,i] b[i,j] c[k,j]. This notation is particularly convenient when you have more dimensions because of batches. For example, if you have two batches of matrices and want to compute the matrix product per batch, you would could this:

  1. torch.einsum('bik,bkj->bij', a, b)

Let’s go back to our new matmul implementation using einsum and look at its speed:

In [ ]:

  1. %timeit -n 20 t5 = matmul(m1,m2)
  1. 68.7 µs ± 4.06 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)

As you can see, not only is it practical, but it’s very fast. einsum is often the fastest way to do custom operations in PyTorch, without diving into C++ and CUDA. (But it’s generally not as fast as carefully optimized CUDA code, as you see from the results in “Matrix Multiplication from Scratch”.)

Now that we know how to implement a matrix multiplication from scratch, we are ready to build our neural net—specifically its forward and backward passes—using just matrix multiplications.