")The Problem with Vanilla RNNs (or Elman RNNs)


Elman RNNs的第一个问题是很难记住长期的信息。例如,在第6章的RNN中,在每次步骤中,我们仅仅更新隐藏的状态向量,而不管它是否有意义。因此,RNN无法控制隐藏状态中保留的值和丢弃的值,而这些值完全由输入决定。直觉上,这是说不通的。我们希望RNN通过某种方式来决定更新是可选的,还是发生了更新,以及状态向量的多少和哪些部分,等等。

Elman RNNs的第二个问题是,它们会导致梯度螺旋地失去控制,趋近于零或无穷大。不稳定的梯度,可以螺旋失控被称为消失梯度或爆炸梯度取决于方向梯度的绝对值正在收缩/增长。梯度绝对值非常大或非常小(小于1)都会使优化过程不稳定(Hochreiter et al., 2001;Pascanu et al., 2013)。


Gating as a Solution to a Vanilla RNN’s Challenges

为了直观地理解门控,假设您添加了两个量,a和b,但是您想控制b放入和的多少。数学上,你可以把a + b的和改写成 a+λb λ是一个值在0和1之间。如果λ= 0,没有贡献从b如果λ= 1,b完全贡献。这种方式看,你可以解释λ充当一个“开关”或“门”控制的b进入之和。这就是门控机制背后的直觉。现在,让我们重新访问Elman RNN,看看如何将门控与普通的RNN合并以进行条件更新。如果前面的隐藏状态是 Chapter 7.自然语言处理的中间 Sequence Modeling - 图1 和当前输入Chapter 7.自然语言处理的中间 Sequence Modeling - 图2, Elman RNN的周期性更新看起来像

其中F是RNN的递归计算。显然,这是一个无条件的和,并且有“Vanilla RNNs(或Elman RNNs)的问题”中描述的缺点。现在想象一下,替代常数,如果前面的例子的λ是一个函数之前的隐藏状态向量ht−1和当前输入xt,而且还产生所需的控制行为;也就是说,0到1之间的值。通过这个门控函数,我们的RNN更新方程如下:


在长短期记忆的情况下,这个基本的直觉是扩展仔细将不仅条件更新,而且还故意忘记之前的隐藏状态Chapter 7.自然语言处理的中间 Sequence Modeling - 图3的值。这种“忘记”乘以发生前隐藏状态与另一个函数μ值ht−1,还产生值在0和1之间,取决于当前的输入:

您可能已经猜到,μ是另一个控制功能。在实际的LSTM描述中,这变得很复杂,因为门函数是参数化的,导致对未初始化的操作的复杂序列。但是,在掌握了本节的直观知识之后,如果您想深入了解LSTM的更新机制,现在就可以了。我们推荐Christopher Olah的经典文章。在本书中,我们将不涉及这些内容,因为这些细节对于LSTMs在NLP应用程序中的应用和使用并不是必需的。

LSTM只是RNN的许多门控变体之一。另一种越来越流行的门控变量是门控循环单元(GRU;Chung et al., 2015)。幸运的是,在PyTorch中,您可以简单地替换nn。RNN或神经网络。RNNCell nn。LSTM和神经网络。LSTMCell没有其他代码更改来切换到LSTM(为GRU做必要的修改)!

门控机制是“普通RNNs(或Elman RNNs)问题”中列举的问题的有效解决方案。它不仅可以控制更新,还可以控制梯度问题,使训练相对容易。不再赘述,我们将使用两个示例来展示这些封闭体系结构的实际应用。

Example: A Character-RNN for Generating Surnames




The SurnamesDataset


SurnamesDataset类与前几章基本相同:我们使用panda DataFrame加载数据集,并构造了一个向量化器,它将令牌封装为模型和手边任务所需的整数映射。为了适应任务的不同,修改了SurnamesDataset.getitem()方法,以输出预测目标的整数序列,如示例7-1所示。该方法引用向量器来计算作为输入的整数序列(from_vector)和作为(to_vector)的整数序列。下一小节将描述向量化的实现。 Example 7-1. The SurnamesDataset.getitem for a sequence prediction task

  1. class SurnameDataset(Dataset):
  2. @classmethod
  3. def load_dataset_and_make_vectorizer(cls, surname_csv):
  4. """Load dataset and make a new vectorizer from scratch
  5. Args:
  6. surname_csv (str): location of the dataset
  7. Returns:
  8. an instance of SurnameDataset
  9. """
  10. surname_df = pd.read_csv(surname_csv)
  11. return cls(surname_df, SurnameVectorizer.from_dataframe(surname_df))
  12. def __getitem__(self, index):
  13. """the primary entry point method for PyTorch datasets
  14. Args:
  15. index (int): the index to the data point
  16. Returns:
  17. a dictionary holding the data point: (x_data, y_target, class_index)
  18. """
  19. row = self._target_df.iloc[index]
  20. from_vector, to_vector = \
  21. self._vectorizer.vectorize(row.surname, self._max_seq_length)
  22. nationality_index = \
  23. self._vectorizer.nationality_vocab.lookup_token(row.nationality)
  24. return {'x_data': from_vector,
  25. 'y_target': to_vector,
  26. 'class_index': nationality_index}

The Vectorization Data Structures






Example 7-2. The code for SurnameVectorizer.vectorize in a sequence prediction task

  1. class SurnameVectorizer(object):
  2. """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
  3. def vectorize(self, surname, vector_length=-1):
  4. """Vectorize a surname into a vector of observations and targets
  5. Args:
  6. surname (str): the surname to be vectorized
  7. vector_length (int): an argument for forcing the length of index vector
  8. Returns:
  9. a tuple: (from_vector, to_vector)
  10. from_vector (numpy.ndarray): the observation vector
  11. to_vector (numpy.ndarray): the target prediction vector
  12. """
  13. indices = [self.char_vocab.begin_seq_index]
  14. indices.extend(self.char_vocab.lookup_token(token) for token in surname)
  15. indices.append(self.char_vocab.end_seq_index)
  16. if vector_length < 0:
  17. vector_length = len(indices) - 1
  18. from_vector = np.zeros(vector_length, dtype=np.int64)
  19. from_indices = indices[:-1]
  20. from_vector[:len(from_indices)] = from_indices
  21. from_vector[len(from_indices):] = self.char_vocab.mask_index
  22. to_vector = np.empty(vector_length, dtype=np.int64)
  23. to_indices = indices[1:]
  24. to_vector[:len(to_indices)] = to_indices
  25. to_vector[len(to_indices):] = self.char_vocab.mask_index
  26. return from_vector, to_vector
  27. @classmethod
  28. def from_dataframe(cls, surname_df):
  29. """Instantiate the vectorizer from the dataset dataframe
  30. Args:
  31. surname_df (pandas.DataFrame): the surname dataset
  32. Returns:
  33. an instance of the SurnameVectorizer
  34. """
  35. char_vocab = SequenceVocabulary()
  36. nationality_vocab = Vocabulary()
  37. for index, row in surname_df.iterrows():
  38. for char in row.surname:
  39. char_vocab.add_token(char)
  40. nationality_vocab.add_token(row.nationality)
  41. return cls(char_vocab, nationality_vocab)

From the ElmanRNN to the GRU


Model 1: Unconditioned Surname Generation Model

第一个模型是无条件的:它在生成姓氏之前不观察国籍。在实践中,非条件意味着GRU的计算不偏向任何国籍。在下一个例子(例子7-3)中,通过初始隐藏向量引入计算偏差。在这个例子中,我们使用一个全为0的向量,这样初始的隐藏状态向量就不会影响计算。 通常,SurnameGenerationModel嵌入字符索引,使用GRU计算其顺序状态,并使用线性层计算令牌预测的概率。更明确地说,非条件SurnameGenerationModel从初始化嵌入层、GRU和线性层开始。 与第6章的序列模型相似,该模型输入了一个整数矩阵。我们使用一个PyTorch嵌入实例char_embed将整数转换为一个三维张量(每个批处理项的向量序列)。这个张量传递给GRU, GRU计算每个序列中每个位置的状态向量。


Example 7-3. The unconditioned surname generation model

  1. class SurnameGenerationModel(nn.Module):
  2. def __init__(self, char_embedding_size, char_vocab_size, rnn_hidden_size,
  3. batch_first=True, padding_idx=0, dropout_p=0.5):
  4. """
  5. Args:
  6. char_embedding_size (int): The size of the character embeddings
  7. char_vocab_size (int): The number of characters to embed
  8. rnn_hidden_size (int): The size of the RNN's hidden state
  9. batch_first (bool): Informs whether the input tensors will
  10. have batch or the sequence on the 0th dimension
  11. padding_idx (int): The index for the tensor padding;
  12. see torch.nn.Embedding
  13. dropout_p (float): the probability of zeroing activations using
  14. the dropout method.
  15. """
  16. super(SurnameGenerationModel, self).__init__()
  17. self.char_emb = nn.Embedding(num_embeddings=char_vocab_size,
  18. embedding_dim=char_embedding_size,
  19. padding_idx=padding_idx)
  20. self.rnn = nn.GRU(input_size=char_embedding_size,
  21. hidden_size=rnn_hidden_size,
  22. batch_first=batch_first)
  23. self.fc = nn.Linear(in_features=rnn_hidden_size,
  24. out_features=char_vocab_size)
  25. self._dropout_p = dropout_p
  26. def forward(self, x_in, apply_softmax=False):
  27. """The forward pass of the model
  28. Args:
  29. x_in (torch.Tensor): an input data tensor.
  30. x_in.shape should be (batch, input_dim)
  31. apply_softmax (bool): a flag for the softmax activation
  32. should be False during training
  33. Returns:
  34. the resulting tensor. tensor.shape should be (batch, output_dim)
  35. """
  36. x_embedded = self.char_emb(x_in)
  37. y_out, _ = self.rnn(x_embedded)
  38. batch_size, seq_size, feat_size = y_out.shape
  39. y_out = y_out.contiguous().view(batch_size * seq_size, feat_size)
  40. y_out = self.fc(F.dropout(y_out, p=self._dropout_p))
  41. if apply_softmax:
  42. y_out = F.softmax(y_out, dim=1)
  43. new_feat_size = y_out.shape[-1]
  44. y_out = y_out.view(batch_size, seq_size, new_feat_size)
  45. return y_out

Model 2: Conditioned Surname Generation Model


例7-3显示了条件模型之间的差异。具体地说,引入额外的嵌入来将国籍索引映射到与RNN的隐藏层相同大小的向量。然后,在正向函数中嵌入民族指标,作为RNN的初始隐含层简单传入。虽然这是对第一个模型的一个非常简单的修改,但是它对于让RNN根据生成的国籍改变其行为有着深远的影响。 Example 7-4. The conditioned surname generation model

  1. class SurnameGenerationModel(nn.Module):
  2. def __init__(self, char_embedding_size, char_vocab_size, num_nationalities,
  3. rnn_hidden_size, batch_first=True, padding_idx=0, dropout_p=0.5):
  4. # ...
  5. self.nation_embedding = nn.Embedding(embedding_dim=rnn_hidden_size,
  6. num_embeddings=num_nationalities)
  7. def forward(self, x_in, nationality_index, apply_softmax=False):
  8. # ...
  9. x_embedded = self.char_embedding(x_in)
  10. # hidden_size: (num_layers * num_directions, batch_size, rnn_hidden_size)
  11. nationality_embedded = self.nation_emb(nationality_index).unsqueeze(0)
  12. y_out, _ = self.rnn(x_embedded, nationality_embedded)
  13. # ...

Training Routine and Results

在本例中,我们介绍了用于生成姓氏的字符序列预测任务。虽然许多实现细节和训练例程与第6章的序列分类示例相似,但有几个主要区别。在这一节中,我们将重点讨论差异、使用的超参数和结果。 与前面的例子相比,计算这个例子中的损失需要两个更改,因为我们在序列中的每一步都要进行预测。首先,我们将三维张量重塑为二维张量(矩阵)以满足计算约束。其次,我们协调掩蔽索引,它允许可变长度序列与损失函数,使损失不使用掩蔽位置在其计算。


Example 7-5. Handling three-dimensional tensors and sequence-wide loss computations

  1. def normalize_sizes(y_pred, y_true):
  2. """Normalize tensor sizes
  3. Args:
  4. y_pred (torch.Tensor): the output of the model
  5. If a 3-dimensional tensor, reshapes to a matrix
  6. y_true (torch.Tensor): the target predictions
  7. If a matrix, reshapes to be a vector
  8. """
  9. if len(y_pred.size()) == 3:
  10. y_pred = y_pred.contiguous().view(-1, y_pred.size(2))
  11. if len(y_true.size()) == 2:
  12. y_true = y_true.contiguous().view(-1)
  13. return y_pred, y_true
  14. def sequence_loss(y_pred, y_true, mask_index):
  15. y_pred, y_true = normalize_sizes(y_pred, y_true)
  16. return F.cross_entropy(y_pred, y_true, ignore_index=mask_index)



Example 7-6. Hyperparameters for surname generation

  1. args = Namespace(
  2. # Data and Path information
  3. surname_csv="data/surnames/surnames_with_splits.csv",
  4. vectorizer_file="vectorizer.json",
  5. model_state_file="model.pth",
  6. save_dir="model_storage/ch7/model1_unconditioned_surname_generation",
  7. # or: save_dir="model_storage/ch7/model2_conditioned_surname_generation",
  8. # Model hyper parameters
  9. char_embedding_size=32,
  10. rnn_hidden_size=32,
  11. # Training hyper parameters
  12. seed=1337,
  13. learning_rate=0.001,
  14. batch_size=128,
  15. num_epochs=100,
  16. early_stopping_criteria=5,
  17. # Runtime options omitted for space
  18. )

尽管预测的每个字符的准确性是模型性能的度量,但是在本例中,通过检查模型将生成的姓氏类型来进行定性评估会更好。为此,我们在forward()方法中步骤的修改版本上编写一个新的循环,以计算每个时间步骤的预测,并将这些预测用作下一个时间步骤的输入。我们将展示示例7-7中的代码。模型在每个时间步上的输出是一个预测向量,利用softmax函数将预测向量转换为概率分布。利用概率分布,我们利用火炬。多项式抽样函数,它以与索引的概率成比例的速率选择索引。抽样是一个每次产生不同输出的随机过程。 Example 7-7. Sampling from the unconditioned generation model

  1. def sample_from_model(model, vectorizer, num_samples=1, sample_size=20,
  2. temperature=1.0):
  3. """Sample a sequence of indices from the model
  4. Args:
  5. model (SurnameGenerationModel): the trained model
  6. vectorizer (SurnameVectorizer): the corresponding vectorizer
  7. num_samples (int): the number of samples
  8. sample_size (int): the max length of the samples
  9. temperature (float): accentuates or flattens
  10. the distribution.
  11. 0.0 < temperature < 1.0 will make it peakier.
  12. temperature > 1.0 will make it more uniform
  13. Returns:
  14. indices (torch.Tensor): the matrix of indices;
  15. shape = (num_samples, sample_size)
  16. """
  17. begin_seq_index = [vectorizer.char_vocab.begin_seq_index
  18. for _ in range(num_samples)]
  19. begin_seq_index = torch.tensor(begin_seq_index,
  20. dtype=torch.int64).unsqueeze(dim=1)
  21. indices = [begin_seq_index]
  22. h_t = None
  23. for time_step in range(sample_size):
  24. x_t = indices[time_step]
  25. x_emb_t = model.char_emb(x_t)
  26. rnn_out_t, h_t = model.rnn(x_emb_t, h_t)
  27. prediction_vector = model.fc(rnn_out_t.squeeze(dim=1))
  28. probability_vector = F.softmax(prediction_vector / temperature, dim=1)
  29. indices.append(torch.multinomial(probability_vector, num_samples=1))
  30. indices = torch.stack(indices).squeeze().permute(1, 0)
  31. return indices


Example 7-8. Mapping sampled indices to surname strings

  1. def decode_samples(sampled_indices, vectorizer):
  2. """Transform indices into the string form of a surname
  3. Args:
  4. sampled_indices (torch.Tensor): the inidces from `sample_from_model`
  5. vectorizer (SurnameVectorizer): the corresponding vectorizer
  6. """
  7. decoded_surnames = []
  8. vocab = vectorizer.char_vocab
  9. for sample_index in range(sampled_indices.shape[0]):
  10. surname = ""
  11. for time_step in range(sampled_indices.shape[1]):
  12. sample_item = sampled_indices[sample_index, time_step].item()
  13. if sample_item == vocab.begin_seq_index:
  14. continue
  15. elif sample_item == vocab.end_seq_index:
  16. break
  17. else:
  18. surname += vocab.lookup_index(sample_item)
  19. decoded_surnames.append(surname)
  20. return decoded_surnames

使用这些函数,您可以检查模型的输出,如示例7-9所示,以了解模型是否正在学习生成合理的姓氏。从检查输出中我们可以学到什么?我们可以看到,尽管这些姓氏似乎遵循着几种形态模式,但这些姓氏显然并不是来自一个国家或另一个国家。一种可能是,学习姓氏的一般模型会混淆不同民族之间的性格分布。有条件的姓氏生成模型就是用来处理这种情况的。 Example 7-9. Sampling from the unconditioned model

  1. Input[0]
  2. samples = sample_from_model(unconditioned_model, vectorizer,
  3. num_samples=10)
  4. decode_samples(samples, vectorizer)
  5. Output[0]
  6. ['Aqtaliby',
  7. 'Yomaghev',
  8. 'Mauasheev',
  9. 'Unander',
  10. 'Virrovo',
  11. 'NInev',
  12. 'Bukhumohe',
  13. 'Burken',
  14. 'Rati',
  15. 'Jzirmar']

对于有条件的SurnameGenerationModel,我们修改sample_from_model()函数来接受国籍索引列表,而不是指定数量的样本。在例7-10中,修改后的函数使用带有国籍嵌入的国籍索引来构造GRU的初始隐藏状态。在此之后,采样过程与非条件模型完全相同。 Example 7-10. Sampling from a sequence model

  1. def sample_from_model(model, vectorizer, nationalities, sample_size=20,
  2. temperature=1.0):
  3. """Sample a sequence of indices from the model
  4. Args:
  5. model (SurnameGenerationModel): the trained model
  6. vectorizer (SurnameVectorizer): the corresponding vectorizer
  7. nationalities (list): a list of integers representing nationalities
  8. sample_size (int): the max length of the samples
  9. temperature (float): accentuates or flattens
  10. the distribution.
  11. 0.0 < temperature < 1.0 will make it peakier.
  12. temperature > 1.0 will make it more uniform
  13. Returns:
  14. indices (torch.Tensor): the matrix of indices;
  15. shape = (num_samples, sample_size)
  16. """
  17. num_samples = len(nationalities)
  18. begin_seq_index = [vectorizer.char_vocab.begin_seq_index
  19. for _ in range(num_samples)]
  20. begin_seq_index = torch.tensor(begin_seq_index,
  21. dtype=torch.int64).unsqueeze(dim=1)
  22. indices = [begin_seq_index]
  23. nationality_indices = torch.tensor(nationalities,
  24. dtype=torch.int64).unsqueeze(dim=0)
  25. h_t = model.nation_emb(nationality_indices)
  26. for time_step in range(sample_size):
  27. x_t = indices[time_step]
  28. x_emb_t = model.char_emb(x_t)
  29. rnn_out_t, h_t = model.rnn(x_emb_t, h_t)
  30. prediction_vector = model.fc(rnn_out_t.squeeze(dim=1))
  31. probability_vector = F.softmax(prediction_vector / temperature, dim=1)
  32. indices.append(torch.multinomial(probability_vector, num_samples=1))
  33. indices = torch.stack(indices).squeeze().permute(1, 0)
  34. return indices

用条件向量采样的有效性意味着我们对生成输出有影响。在示例7-11中,我们迭代国籍索引并从每个索引中取样。为了节省空间,我们只显示一些输出。从这些输出中,我们可以看到,该模型确实采用了姓氏拼写的一些模式。 Example 7-11. Sampling from a conditioned SurnameGenerationModel (not all outputs are shown)

  1. Input[0]
  2. for index in range(len(vectorizer.nationality_vocab)):
  3. nationality = vectorizer.nationality_vocab.lookup_index(index)
  4. print("Sampled for {}: ".format(nationality))
  5. sampled_indices = sample_from_model(model=conditioned_model,
  6. vectorizer=vectorizer,
  7. nationalities=[index] * 3,
  8. temperature=0.7)
  9. for sampled_surname in decode_samples(sampled_indices,
  10. vectorizer):
  11. print("- " + sampled_surname)
  12. Output[0]
  13. Sampled for Arabic:
  14. - Khatso
  15. - Salbwa
  16. - Gadi
  17. Sampled for Chinese:
  18. - Lie
  19. - Puh
  20. - Pian
  21. Sampled for German:
  22. - Lenger
  23. - Schanger
  24. - Schumper
  25. Sampled for Irish:
  26. - Mcochin
  27. - Corran
  28. - O'Baintin
  29. Sampled for Russian:
  30. - Mahghatsunkov
  31. - Juhin
  32. - Karkovin
  33. Sampled for Vietnamese:
  34. - Lo
  35. - Tham
  36. - Tou

Tips and Tricks for Training Sequence Models

序列模型很难训练,而且在这个过程中会出现许多问题。在这里,我们总结了一些技巧和技巧,我们发现不仅在我们的工作中有用,而且也被其他人在文献报道。 1.如果可能,使用门控变量 门控体系结构通过解决非通配型的许多数值稳定性问题简化了训练。 2.如果可能,请选择GRUs而不是LSTMs GRUs提供了与LSTMs几乎相同的性能,并且使用更少的参数和计算。幸运的是,从PyTorch的角度来看,除了简单地使用不同的模块类之外,在LSTM上使用GRU没有什么可做的。 3.使用Adam作为您的优化器 在第6章、第7章和第8章中,我们只使用Adam作为优化器,这是有充分理由的:它是可靠的,收敛速度更快。对于序列模型尤其如此。如果由于某些原因,您的模型没有与Adam收敛,那么在这种情况下,切换到随机梯度下降可能会有所帮助。 4.梯度剪裁 如果您注意到在应用这些章节中学习到的概念时出现了数字错误,请在训练过程中使用您的代码绘制梯度值。知道愤怒之后,剪掉任何异常值。这将确保更顺利的培训。在PyTorch中,有一个有用的实用程序clip_grad_norm可以为您完成此工作,如示例7-12所示。一般来说,你应该养成剪切渐变的习惯。 Example 7-12. Applying gradient clipping in PyTorch

  1. # define your sequence model
  2. model = ..
  3. # define loss function
  4. loss_function = ..
  5. # training loop
  6. for _ in ...:
  7. ...
  8. model.zero_grad()
  9. output, hidden = model(data, hidden)
  10. loss = loss_function(output, targets)
  11. loss.backward()
  12. torch.nn.utils.clip_grad_norm(model.parameters(), 0.25)
  13. ...

5.早期停止 对于序列模型,很容易过度拟合。我们建议您在评估错误(在开发集上测量的)开始出现时尽早停止培训过程。
