用N-Gram模型在莎士比亚文集中训练word embedding

作者: PaddlePaddle
日期: 2021.05
摘要: N-gram 是计算机语言学和概率论范畴内的概念,是指给定的一段文本中N个项目的序列。N=1 时 N-gram 又称为 unigram,N=2 称为 bigram,N=3 称为 trigram,以此类推。实际应用通常采用 bigram 和 trigram 进行计算。本示例在莎士比亚文集上实现了trigram。

一、环境配置

本教程基于Paddle 2.1 编写,如果你的环境不是本版本,请先参考官网安装 Paddle 2.1 。

  1. import paddle
  2. paddle.__version__
  1. '2.1.0'

二、数据集&&相关参数

2.1 数据集下载

训练数据集采用了莎士比亚文集,点击下载后,保存为txt格式即可。
context_size设为2,意味着是trigram。embedding_dim设为256。

  1. !wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
  1. --2021-05-18 16:44:36-- https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
  2. Resolving ocw.mit.edu (ocw.mit.edu)... 151.101.110.133
  3. Connecting to ocw.mit.edu (ocw.mit.edu)|151.101.110.133|:443... connected.
  4. HTTP request sent, awaiting response... 200 OK
  5. Length: 5458199 (5.2M) [text/plain]
  6. Saving to: t8.shakespeare.txt
  7. t8.shakespeare.txt. 100%[===================>] 5.21M 47.6KB/s in 1m 50s
  8. 2021-05-18 16:46:27 (48.5 KB/s) - t8.shakespeare.txt saved [5458199/5458199]
  1. # 文件路径
  2. path_to_file = './t8.shakespeare.txt'
  3. test_sentence = open(path_to_file, 'rb').read().decode(encoding='utf-8')
  4. # 文本长度是指文本中的字符个数
  5. print ('Length of text: {} characters'.format(len(test_sentence)))
  1. Length of text: 5458199 characters

2.2 数据预处理

因为标点符号本身无实际意义,用string库中的punctuation,完成英文符号的替换。

  1. from string import punctuation
  2. process_dicts={i:'' for i in punctuation}
  3. print(process_dicts)
  4. punc_table = str.maketrans(process_dicts)
  5. test_sentence = test_sentence.translate(punc_table)
  1. {'!': '', '"': '', '#': '', '$': '', '%': '', '&': '', "'": '', '(': '', ')': '', '*': '', '+': '', ',': '', '-': '', '.': '', '/': '', ':': '', ';': '', '<': '', '=': '', '>': '', '?': '', '@': '', '[': '', '': '', ']': '', '^': '', '_': '', '`': '', '{': '', '|': '', '}': '', '~': ''}

由于词表的的长尾,会降低模型训练的速度与精度。因此取词频前2500的单词作为词表,如果不在词表中的单词都用 ‘ ’ 替换。

  1. test_sentence_list = test_sentence.lower().split()
  2. word_dict_count = {}
  3. for word in test_sentence_list:
  4. word_dict_count[word] = word_dict_count.get(word, 0) + 1
  5. word_list = []
  6. soted_word_list = sorted(word_dict_count.items(), key=lambda x: x[1], reverse=True)
  7. for key in soted_word_list:
  8. word_list.append(key[0])
  9. word_list = word_list[:2500]
  10. print(len(word_list))
  1. 2500

2.3 模型参数设置

设置模型训练常用的参数。

  1. # 设置参数
  2. hidden_size = 1024 # Linear层 参数
  3. embedding_dim = 256 # embedding 维度
  4. batch_size = 256 # batch size 大小
  5. context_size = 2 # 上下文长度
  6. vocab_size = len(word_list) + 1 # 词表大小
  7. epochs = 2 # 迭代轮数

三、数据加载

3.1 数据格式

将文本被拆成了元组的形式,格式为((‘第一个词’, ‘第二个词’), ‘第三个词’);其中,第三个词就是目标。

  1. trigram = [[[test_sentence_list[i], test_sentence_list[i + 1]], test_sentence_list[i + 2]]
  2. for i in range(len(test_sentence_list) - 2)]
  3. word_to_idx = {word: i+1 for i, word in enumerate(word_list)}
  4. word_to_idx['<pad>'] = 0
  5. idx_to_word = {word_to_idx[word]: word for word in word_to_idx}
  6. # 看一下数据集
  7. print(trigram[:3])
  1. [[['this', 'is'], 'the'], [['is', 'the'], '100th'], [['the', '100th'], 'etext']]

3.2 构建Dataset类 加载数据

paddle.io.Dataset构建数据集,然后作为参数传入到paddle.io.DataLoader,完成数据集的加载。

  1. import numpy as np
  2. class TrainDataset(paddle.io.Dataset):
  3. def __init__(self, tuple_data):
  4. self.tuple_data = tuple_data
  5. def __getitem__(self, idx):
  6. data = self.tuple_data[idx][0]
  7. label = self.tuple_data[idx][1]
  8. data = np.array(list(map(lambda word: word_to_idx.get(word, 0), data)))
  9. label = np.array(word_to_idx.get(label, 0))
  10. return data, label
  11. def __len__(self):
  12. return len(self.tuple_data)
  13. train_dataset = TrainDataset(trigram)
  14. # 加载数据
  15. train_loader = paddle.io.DataLoader(train_dataset, return_list=True, shuffle=True,
  16. batch_size=batch_size, drop_last=True)

四、模型组网

这里用paddle动态图的方式组网。为了构建Trigram模型,用一层 Embedding 与两层 Linear 完成构建。Embedding 层对输入的前两个单词embedding,然后输入到后面的两个Linear层中,完成特征提取。

  1. import paddle.nn.functional as F
  2. class NGramModel(paddle.nn.Layer):
  3. def __init__(self, vocab_size, embedding_dim, context_size):
  4. super(NGramModel, self).__init__()
  5. self.embedding = paddle.nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
  6. self.linear1 = paddle.nn.Linear(context_size * embedding_dim, hidden_size)
  7. self.linear2 = paddle.nn.Linear(hidden_size, vocab_size)
  8. def forward(self, x):
  9. x = self.embedding(x)
  10. x = paddle.reshape(x, [-1, context_size * embedding_dim])
  11. x = self.linear1(x)
  12. x = F.relu(x)
  13. x = self.linear2(x)
  14. return x

五、 方式1:基于高层API,完成模型的训练与预测

5.1 自定义Callback

在训练过程中,有时需要根据模型训练过程中loss,打印loss下降曲线来调参。为了保存训练时每个batch的loss信息,需要自己定义Callback函数,完成模型训练时loss信息的记录。具体的方式如下:

  1. # 自定义Callback 需要继承基类 Callback
  2. class LossCallback(paddle.callbacks.Callback):
  3. def __init__(self):
  4. self.losses = []
  5. def on_train_begin(self, logs={}):
  6. # 在fit前 初始化losses,用于保存每个batch的loss结果
  7. self.losses = []
  8. def on_train_batch_end(self, step, logs={}):
  9. # 每个batch训练完成后调用,把当前loss添加到losses中
  10. self.losses.append(logs.get('loss'))
  11. loss_log = LossCallback()

5.2 模型训练

完成组网与自定义Callback后,将模型用 Model 封装后,就可以用 Model.prepare()、Model.fit() 开始训练。

  1. n_gram_model = paddle.Model(NGramModel(vocab_size, embedding_dim, context_size)) # 用 Model封装 NGramModel
  2. # 模型配置
  3. n_gram_model.prepare(optimizer=paddle.optimizer.Adam(learning_rate=0.01,
  4. parameters=n_gram_model.parameters()),
  5. loss=paddle.nn.CrossEntropyLoss())
  6. # 模型训练
  7. n_gram_model.fit(train_loader,
  8. epochs=epochs,
  9. batch_size=batch_size,
  10. callbacks=[loss_log],
  11. verbose=1)
  1. The loss value printed in the log is the current step, and the metric is the average value of previous steps.
  2. Epoch 1/2
  3. step 3519/3519 [==============================] - loss: 5.0316 - 79ms/step
  4. Epoch 2/2
  5. step 3519/3519 [==============================] - loss: 5.1612 - 79ms/step

5.3 loss可视化

利用 matplotlib 工具,完成loss的可视化

  1. # 可视化 loss
  2. import matplotlib.pyplot as plt
  3. import matplotlib.ticker as ticker
  4. %matplotlib inline
  5. log_loss = [loss_log.losses[i] for i in range(0, len(loss_log.losses), 500)]
  6. plt.figure()
  7. plt.plot(log_loss)
  1. [<matplotlib.lines.Line2D at 0x7f2bd8598050>]

png

六、方式2:基于基础API,完成模型的训练与预测

6.1 自定义 train 函数

通过基础API,自定义 train 函数,完成模型的训练。

  1. import paddle.nn.functional as F
  2. losses = []
  3. def train(model):
  4. model.train()
  5. optim = paddle.optimizer.Adam(learning_rate=0.01, parameters=model.parameters())
  6. for epoch in range(epochs):
  7. for batch_id, data in enumerate(train_loader()):
  8. x_data = data[0]
  9. y_data = data[1]
  10. predicts = model(x_data)
  11. loss = F.cross_entropy(predicts, y_data)
  12. loss.backward()
  13. if batch_id % 500 == 0:
  14. losses.append(loss.numpy())
  15. print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, loss.numpy()))
  16. optim.step()
  17. optim.clear_grad()
  18. model = NGramModel(vocab_size, embedding_dim, context_size)
  19. train(model)
  1. epoch: 0, batch_id: 0, loss is: [7.825837]
  2. epoch: 0, batch_id: 500, loss is: [5.1986523]
  3. epoch: 0, batch_id: 1000, loss is: [5.179163]
  4. epoch: 0, batch_id: 1500, loss is: [5.160289]
  5. epoch: 0, batch_id: 2000, loss is: [5.082153]
  6. epoch: 0, batch_id: 2500, loss is: [5.36201]
  7. epoch: 0, batch_id: 3000, loss is: [5.469225]
  8. epoch: 0, batch_id: 3500, loss is: [5.142579]
  9. epoch: 1, batch_id: 0, loss is: [5.016885]
  10. epoch: 1, batch_id: 500, loss is: [5.217623]
  11. epoch: 1, batch_id: 1000, loss is: [5.1326265]
  12. epoch: 1, batch_id: 1500, loss is: [5.1721525]
  13. epoch: 1, batch_id: 2000, loss is: [5.0461006]
  14. epoch: 1, batch_id: 2500, loss is: [5.3661375]
  15. epoch: 1, batch_id: 3000, loss is: [5.2548814]
  16. epoch: 1, batch_id: 3500, loss is: [5.223076]

6.2 loss可视化

通过可视化loss的曲线,可以看到模型训练的效果。

  1. import matplotlib.pyplot as plt
  2. import matplotlib.ticker as ticker
  3. %matplotlib inline
  4. plt.figure()
  5. plt.plot(losses)
  1. [<matplotlib.lines.Line2D at 0x7f2bd11d9a10>]

png

6.3 预测

用训练好的模型进行预测。

  1. import random
  2. def test(model):
  3. model.eval()
  4. # 从最后10组数据中随机选取1个
  5. idx = random.randint(len(trigram)-10, len(trigram)-1)
  6. print('the input words is: ' + trigram[idx][0][0] + ', ' + trigram[idx][0][1])
  7. x_data = list(map(lambda word: word_to_idx.get(word, 0), trigram[idx][0]))
  8. x_data = paddle.to_tensor(np.array(x_data))
  9. predicts = model(x_data)
  10. predicts = predicts.numpy().tolist()[0]
  11. predicts = predicts.index(max(predicts))
  12. print('the predict words is: ' + idx_to_word[predicts])
  13. y_data = trigram[idx][1]
  14. print('the true words is: ' + y_data)
  15. test(model)
  1. the input words is: works, of
  2. the predict words is: william
  3. the true words is: william