使用协同过滤实现电影推荐

作者: HUANGCHENGAI
日期: 2021.05
摘要: 本案例使用飞桨框架实现推荐电影的协同过滤算法。

一、介绍

此示例演示使用Movielens 数据集基于PaddlePaddle2.1向用户推荐电影的协作过滤算法。MovieLens 评级数据集列出了一组用户对一组电影的评分。目标是能够预测用户尚未观看的电影的收视率。然后,可以向用户推荐预测收视率最高的电影。

模型中的步骤如下:

  1. 1.通过嵌入矩阵将用户 ID 映射到"用户向量"
  2. 2.通过嵌入矩阵将电影 ID 映射到"电影载体"
  3. 3.计算用户矢量和电影矢量之间的点产品,以获得用户和电影之间的匹配分数(预测评级)。
  4. 4.使用所有已知的用户电影对通过梯度下降训练嵌入。

引用:

二、 环境设置

本教程基于Paddle 2.1 编写,如果你的环境不是本版本,请先参考官网安装 Paddle 2.1 。

  1. import pandas as pd
  2. import numpy as np
  3. import paddle
  4. import paddle.nn as nn
  5. from paddle.io import Dataset
  6. print(paddle.__version__)
  1. 2.1.0

三、数据集

这个数据集(ml-latest-small)描述了MovieLens的五星评级和自由文本标记活动。它包含100836个收视率和3683个标签应用程序,涵盖9742部电影。这些数据由610名用户在1996年3月29日至2018年9月24日期间创建。

该数据集于2018年9月26日生成,用户是随机选择的。所有选定的用户都对至少20部电影进行了评分。不包括人口统计信息。每个用户都由一个id表示,不提供其他信息。数据包含在文件中links.csv, movies.csv, ratings.csv以及tags.csv

用户ID

MovieLens的用户是随机选择的

电影ID

数据集中只包含至少具有一个分级或标记的电影,这些电影id与MovieLens网站上使用的一致.。

分级数据文件结构(ratings.csv)

所有评级都包含在文件中ratings.csv. 文件头行后的每一行代表一个用户对一部电影的一个分级,格式如下: userId,movieId,rating,timestamp

标记数据文件结构(tags.csv)

文件中包含所有标记tags.csv. 文件头行后的每一行代表一个用户应用于一部电影的一个标记,格式如下: userId,movieId,tag,timestamp

电影数据文件结构(movies.csv)

格式如下: 电影ID、片名、类型

链接数据文件结构(links.csv)

格式如下: 电影ID,imdbId,tmdbId

  1. !unzip data/data71839/ml-latest-small.zip
  1. Archive: data/data71839/ml-latest-small.zip
  2. creating: ml-latest-small/
  3. inflating: ml-latest-small/links.csv
  4. inflating: ml-latest-small/tags.csv
  5. inflating: ml-latest-small/ratings.csv
  6. inflating: ml-latest-small/README.txt
  7. inflating: ml-latest-small/movies.csv

3.1 数据处理

执行一些预处理,将用户和电影编码为整数指数

  1. df = pd.read_csv('ml-latest-small/ratings.csv')
  2. user_ids = df["userId"].unique().tolist()
  3. user2user_encoded = {x: i for i, x in enumerate(user_ids)}
  4. userencoded2user = {i: x for i, x in enumerate(user_ids)}
  5. movie_ids = df["movieId"].unique().tolist()
  6. movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
  7. movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}
  8. df["user"] = df["userId"].map(user2user_encoded)
  9. df["movie"] = df["movieId"].map(movie2movie_encoded)
  10. num_users = len(user2user_encoded)
  11. num_movies = len(movie_encoded2movie)
  12. df["rating"] = df["rating"].values.astype(np.float32)
  13. # 最小和最大额定值将在以后用于标准化额定值
  14. min_rating = min(df["rating"])
  15. max_rating = max(df["rating"])
  16. print(
  17. "Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating: {}".format(
  18. num_users, num_movies, min_rating, max_rating
  19. )
  20. )
  1. Number of users: 610, Number of Movies: 9724, Min rating: 0.5, Max rating: 5.0

3.2 准备训练和验证数据

  1. df = df.sample(frac=1, random_state=42)
  2. x = df[["user", "movie"]].values
  3. # 规范化0和1之间的目标。使训练更容易。
  4. y = df["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values
  5. # 假设对90%的数据进行训练,对10%的数据进行验证。
  6. train_indices = int(0.9 * df.shape[0])
  7. x_train, x_val, y_train, y_val = (
  8. x[:train_indices],
  9. x[train_indices:],
  10. y[:train_indices],
  11. y[train_indices:],
  12. )
  13. y_train = y_train[: ,np.newaxis]
  14. y_val = y_val[: ,np.newaxis]
  15. y_train = y_train.astype(np.float32)
  16. y_val = y_val.astype(np.float32)
  17. # 自定义数据集
  18. #映射式(map-style)数据集需要继承paddle.io.Dataset
  19. class SelfDefinedDataset(Dataset):
  20. def __init__(self, data_x, data_y, mode = 'train'):
  21. super(SelfDefinedDataset, self).__init__()
  22. self.data_x = data_x
  23. self.data_y = data_y
  24. self.mode = mode
  25. def __getitem__(self, idx):
  26. if self.mode == 'predict':
  27. return self.data_x[idx]
  28. else:
  29. return self.data_x[idx], self.data_y[idx]
  30. def __len__(self):
  31. return len(self.data_x)
  32. traindataset = SelfDefinedDataset(x_train, y_train)
  33. for data, label in traindataset:
  34. print(data.shape, label.shape)
  35. print(data, label)
  36. break
  37. train_loader = paddle.io.DataLoader(traindataset, batch_size = 128, shuffle = True)
  38. for batch_id, data in enumerate(train_loader()):
  39. x_data = data[0]
  40. y_data = data[1]
  41. print(x_data.shape)
  42. print(y_data.shape)
  43. break
  44. testdataset = SelfDefinedDataset(x_val, y_val)
  45. test_loader = paddle.io.DataLoader(testdataset, batch_size = 128, shuffle = True)
  46. for batch_id, data in enumerate(test_loader()):
  47. x_data = data[0]
  48. y_data = data[1]
  49. print(x_data.shape)
  50. print(y_data.shape)
  51. break
  1. (2,) (1,)
  2. [ 431 4730] [0.8888889]
  3. [128, 2]
  4. [128, 1]
  5. [128, 2]
  6. [128, 1]

四、模型组网

将用户和电影嵌入到 50 维向量中。

该模型计算用户和电影嵌入之间的匹配分数,并添加每部电影和每个用户的偏差。比赛分数通过 sigmoid 缩放到间隔[0, 1]。

  1. EMBEDDING_SIZE = 50
  2. class RecommenderNet(nn.Layer):
  3. def __init__(self, num_users, num_movies, embedding_size):
  4. super(RecommenderNet, self).__init__()
  5. self.num_users = num_users
  6. self.num_movies = num_movies
  7. self.embedding_size = embedding_size
  8. weight_attr_user = paddle.ParamAttr(
  9. regularizer = paddle.regularizer.L2Decay(1e-6),
  10. initializer = nn.initializer.KaimingNormal()
  11. )
  12. self.user_embedding = nn.Embedding(
  13. num_users,
  14. embedding_size,
  15. weight_attr=weight_attr_user
  16. )
  17. self.user_bias = nn.Embedding(num_users, 1)
  18. weight_attr_movie = paddle.ParamAttr(
  19. regularizer = paddle.regularizer.L2Decay(1e-6),
  20. initializer = nn.initializer.KaimingNormal()
  21. )
  22. self.movie_embedding = nn.Embedding(
  23. num_movies,
  24. embedding_size,
  25. weight_attr=weight_attr_movie
  26. )
  27. self.movie_bias = nn.Embedding(num_movies, 1)
  28. def forward(self, inputs):
  29. user_vector = self.user_embedding(inputs[:, 0])
  30. user_bias = self.user_bias(inputs[:, 0])
  31. movie_vector = self.movie_embedding(inputs[:, 1])
  32. movie_bias = self.movie_bias(inputs[:, 1])
  33. dot_user_movie = paddle.dot(user_vector, movie_vector)
  34. x = dot_user_movie + user_bias + movie_bias
  35. x = nn.functional.sigmoid(x)
  36. return x

五、模型训练

后台可通过VisualDl观察Loss曲线。

  1. model = RecommenderNet(num_users, num_movies, EMBEDDING_SIZE)
  1. model = paddle.Model(model)
  2. optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=0.0003)
  3. loss = nn.BCELoss()
  4. metric = paddle.metric.Accuracy()
  5. # 设置visualdl路径
  6. log_dir = './visualdl'
  7. callback = paddle.callbacks.VisualDL(log_dir=log_dir)
  8. model.prepare(optimizer, loss, metric)
  9. model.fit(train_loader, epochs=5, save_dir='./checkpoints', verbose=1, callbacks=callback)
  1. The loss value printed in the log is the current step, and the metric is the average value of previous steps.
  2. Epoch 1/5
  3. step 709/709 [==============================] - loss: 0.6729 - acc: 0.8687 - 3ms/step
  4. save checkpoint at /home/aistudio/checkpoints/0
  5. Epoch 2/5
  6. step 709/709 [==============================] - loss: 0.6535 - acc: 0.8687 - 3ms/step
  7. save checkpoint at /home/aistudio/checkpoints/1
  8. ...

六、模型评估

  1. model.evaluate(test_loader, batch_size=64, verbose=1)
  1. Eval begin...
  2. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  3. step 79/79 [==============================] - loss: 0.5982 - acc: 0.8713 - 3ms/step
  4. Eval samples: 10084
  5. {'loss': [0.5982282], 'acc': 0.8712812376041253}

七、模型预测

  1. movie_df = pd.read_csv('ml-latest-small/movies.csv')
  2. # 获取一个用户,查看他的推荐电影
  3. user_id = df.userId.sample(1).iloc[0]
  4. movies_watched_by_user = df[df.userId == user_id]
  5. movies_not_watched = movie_df[
  6. ~movie_df["movieId"].isin(movies_watched_by_user.movieId.values)
  7. ]["movieId"]
  8. movies_not_watched = list(
  9. set(movies_not_watched).intersection(set(movie2movie_encoded.keys()))
  10. )
  11. movies_not_watched = [[movie2movie_encoded.get(x)] for x in movies_not_watched]
  12. user_encoder = user2user_encoded.get(user_id)
  13. user_movie_array = np.hstack(
  14. ([[user_encoder]] * len(movies_not_watched), movies_not_watched)
  15. )
  16. testdataset = SelfDefinedDataset(user_movie_array, user_movie_array, mode = 'predict')
  17. test_loader = paddle.io.DataLoader(testdataset, batch_size = 9703, shuffle = False, return_list=True,)
  18. ratings = model.predict(test_loader)
  19. ratings = np.array(ratings)
  20. ratings = np.squeeze(ratings, 0)
  21. ratings = np.squeeze(ratings, 2)
  22. ratings = np.squeeze(ratings, 0)
  23. top_ratings_indices = ratings.argsort()[::-1][0:10]
  24. print(top_ratings_indices)
  25. recommended_movie_ids = [
  26. movie_encoded2movie.get(movies_not_watched[x][0]) for x in top_ratings_indices
  27. ]
  28. print("用户的ID为: {}".format(user_id))
  29. print("====" * 8)
  30. print("用户评分较高的电影:")
  31. print("----" * 8)
  32. top_movies_user = (
  33. movies_watched_by_user.sort_values(by="rating", ascending=False)
  34. .head(5)
  35. .movieId.values
  36. )
  37. movie_df_rows = movie_df[movie_df["movieId"].isin(top_movies_user)]
  38. for row in movie_df_rows.itertuples():
  39. print(row.title, ":", row.genres)
  40. print("----" * 8)
  41. print("为用户推荐的10部电影:")
  42. print("----" * 8)
  43. recommended_movies = movie_df[movie_df["movieId"].isin(recommended_movie_ids)]
  44. for row in recommended_movies.itertuples():
  45. print(row.title, ":", row.genres)
  1. Predict begin...
  2. step 1/1 [==============================] - 17ms/step
  3. Predict samples: 9492
  4. [ 280 261 318 43 230 472 2393 8253 964 1874]
  5. 用户的ID为: 594
  6. ================================
  7. 用户评分较高的电影:
  8. --------------------------------
  9. Demolition Man (1993) : Action|Adventure|Sci-Fi
  10. Executive Decision (1996) : Action|Adventure|Thriller
  11. Matrix, The (1999) : Action|Sci-Fi|Thriller
  12. Bruce Almighty (2003) : Comedy|Drama|Fantasy|Romance
  13. Chasing Liberty (2004) : Comedy|Romance
  14. --------------------------------
  15. 为用户推荐的10部电影:
  16. --------------------------------
  17. Usual Suspects, The (1995) : Crime|Mystery|Thriller
  18. Star Wars: Episode IV - A New Hope (1977) : Action|Adventure|Sci-Fi
  19. Pulp Fiction (1994) : Comedy|Crime|Drama|Thriller
  20. Shawshank Redemption, The (1994) : Crime|Drama
  21. Forrest Gump (1994) : Comedy|Drama|Romance|War
  22. Schindler's List (1993) : Drama|War
  23. Star Wars: Episode V - The Empire Strikes Back (1980) : Action|Adventure|Sci-Fi
  24. American History X (1998) : Crime|Drama
  25. Fight Club (1999) : Action|Crime|Drama|Thriller
  26. Dark Knight, The (2008) : Action|Crime|Drama|IMAX