Interpreting Embeddings and Biases

Our model is already useful, in that it can provide us with movie recommendations for our users—but it is also interesting to see what parameters it has discovered. The easiest to interpret are the biases. Here are the movies with the lowest values in the bias vector:

In [ ]:

  1. movie_bias = learn.model.movie_bias.squeeze()
  2. idxs = movie_bias.argsort()[:5]
  3. [dls.classes['title'][i] for i in idxs]

Out[ ]:

  1. ['Children of the Corn: The Gathering (1996)',
  2. 'Lawnmower Man 2: Beyond Cyberspace (1996)',
  3. 'Beautician and the Beast, The (1997)',
  4. 'Crow: City of Angels, The (1996)',
  5. 'Home Alone 3 (1997)']

Think about what this means. What it’s saying is that for each of these movies, even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth), they still generally don’t like it. We could have simply sorted the movies directly by their average rating, but looking at the learned bias tells us something much more interesting. It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people tend not to like watching it even if it is of a kind that they would otherwise enjoy! By the same token, here are the movies with the highest bias:

In [ ]:

  1. idxs = movie_bias.argsort(descending=True)[:5]
  2. [dls.classes['title'][i] for i in idxs]

Out[ ]:

  1. ['L.A. Confidential (1997)',
  2. 'Titanic (1997)',
  3. 'Silence of the Lambs, The (1991)',
  4. 'Shawshank Redemption, The (1994)',
  5. 'Star Wars (1977)']

So, for instance, even if you don’t normally enjoy detective movies, you might enjoy LA Confidential!

It is not quite so easy to directly interpret the embedding matrices. There are just too many factors for a human to look at. But there is a technique that can pull out the most important underlying directions in such a matrix, called principal component analysis (PCA). We will not be going into this in detail in this book, because it is not particularly important for you to understand to be a deep learning practitioner, but if you are interested then we suggest you check out the fast.ai course Computational Linear Algebra for Coders. <> shows what our movies look like based on two of the strongest PCA components.

In [ ]:

  1. #hide_input
  2. #id img_pca_movie
  3. #caption Representation of movies based on two strongest PCA components
  4. #alt Representation of movies based on two strongest PCA components
  5. g = ratings.groupby('title')['rating'].count()
  6. top_movies = g.sort_values(ascending=False).index.values[:1000]
  7. top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
  8. movie_w = learn.model.movie_factors[top_idxs].cpu().detach()
  9. movie_pca = movie_w.pca(3)
  10. fac0,fac1,fac2 = movie_pca.t()
  11. idxs = list(range(50))
  12. X = fac0[idxs]
  13. Y = fac2[idxs]
  14. plt.figure(figsize=(12,12))
  15. plt.scatter(X, Y)
  16. for i, x, y in zip(top_movies[idxs], X, Y):
  17. plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
  18. plt.show()

Interpreting Embeddings and Biases - 图1

We can see here that the model seems to have discovered a concept of classic versus pop culture movies, or perhaps it is critically acclaimed that is represented here.

j: No matter how many models I train, I never stop getting moved and surprised by how these randomly initialized bunches of numbers, trained with such simple mechanics, manage to discover things about my data all by themselves. It almost seems like cheating, that I can create code that does useful things without ever actually telling it how to do those things!

We defined our model from scratch to teach you what is inside, but you can directly use the fastai library to build it. We’ll look at how to do that next.

Using fastai.collab

We can create and train a collaborative filtering model using the exact structure shown earlier by using fastai’s collab_learner:

In [ ]:

  1. learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))

In [ ]:

  1. learn.fit_one_cycle(5, 5e-3, wd=0.1)
epochtrain_lossvalid_losstime
00.9317510.95380600:13
10.8518260.87811900:13
20.7152540.83471100:13
30.5831730.82147000:13
40.4966250.82168800:13

The names of the layers can be seen by printing the model:

In [ ]:

  1. learn.model

Out[ ]:

  1. EmbeddingDotBias(
  2. (u_weight): Embedding(944, 50)
  3. (i_weight): Embedding(1635, 50)
  4. (u_bias): Embedding(944, 1)
  5. (i_bias): Embedding(1635, 1)
  6. )

We can use these to replicate any of the analyses we did in the previous section—for instance:

In [ ]:

  1. movie_bias = learn.model.i_bias.weight.squeeze()
  2. idxs = movie_bias.argsort(descending=True)[:5]
  3. [dls.classes['title'][i] for i in idxs]

Out[ ]:

  1. ['Titanic (1997)',
  2. "Schindler's List (1993)",
  3. 'Shawshank Redemption, The (1994)',
  4. 'L.A. Confidential (1997)',
  5. 'Silence of the Lambs, The (1991)']

Another interesting thing we can do with these learned embeddings is to look at distance.

Embedding Distance

On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\sqrt{x^{2}+y^{2}}$ (assuming that x and y are the distances between the coordinates on each axis). For a 50-dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.

If there were two movies that were nearly identical, then their embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies’ embedding vectors can define that similarity. We can use this to find the most similar movie to Silence of the Lambs:

In [ ]:

  1. movie_factors = learn.model.i_weight.weight
  2. idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
  3. distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
  4. idx = distances.argsort(descending=True)[1]
  5. dls.classes['title'][idx]

Out[ ]:

  1. 'Dial M for Murder (1954)'

Now that we have succesfully trained a model, let’s see how to deal with the situation where we have no data for a user. How can we make recommendations to new users?