Chapter 6.自然语言处理 Sequence Modeling


文章作者:Yif Du

发布时间:2018年12月24日 - 12:12

最后更新:2018年12月28日 - 11:12


许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

序列是项目的有序集合。传统的机器学习假设数据点是独立的、相同分布的(IID),但在许多情况下,如语言、语音和时间序列数据,一个数据项取决于它之前或之后的数据项。这种数据也称为序列数据。在人类语言中,顺序信息无处不在。例如,语音可以被看作是音素的基本单元序列。在像英语这样的语言中,句子中的单词不是随意的。他们可能会被它之前或之后的词所束缚。例如,在英语中,介词“of”后面可能跟着冠词“the”;例如,“The lion is the king of the jungle.”。例如,在许多语言中,包括英语,动词的数量必须与句子主语的数量一致。这里有一个例子: The book is on the table The books are on the table. 有时这些依赖项或约束可以是任意长的。例如: The book that I got yesterday is on the table. The books read by the second grade children are shelved in the lower rack. 简而言之,理解序列对于理解人类语言至关重要。在前几章中,我们介绍了前馈神经网络,如多层感知器(MLPs)和卷积神经网络(CNNs),以及向量表示的能力。尽管使用这些技术可以完成大量的自然语言处理(NLP)任务,但正如我们将在本章以及第7章和第8章中学习的那样,它们并不能充分建模序列。

传统的方法,模型序列在NLP使用隐马尔科夫模型,条件随机场,和其他类型的概率图形模型,虽然没有讨论在这本书仍然是相关的。我们邀请您(Koller and Friedman, 2009)。



Introduction to Recurrent Neural Networks

递归神经网络(RNNs)的目的是建立张量序列的模型。rnn和前馈网络一样,是一类模型。RNN家族中有几个不同的成员,但在本章中,我们只讨论最基本的形式,有时称为Elman RNN。递归网络(基本的Elman形式和第7章中概述的更复杂的形式)的目标是学习序列的表示。这是通过维护一个隐藏的状态向量来实现的,它捕获了序列的当前状态。隐藏状态向量由当前输入向量和前一个隐藏状态向量计算得到。这些关系如图6-1所示,图6-1显示了计算依赖项的函数(左)视图和“展开”(右)视图。

Implementing an Elman RNN

为了探究RNN的细节,让我们逐步了解Elman RNN的一个简单实现。PyTorch提供了许多有用的类和帮助函数来构建rnn。PyTorch RNN类实现了Elman RNN。在本章中,我们没有直接使用PyTorch的RNN类,而是使用RNNCell,它是对RNN的单个时间步的抽象,并以此构建RNN。我们这样做的目的是显式地向您展示RNN计算。示例6-1中显示的类ElmanRNN利用了RNNCell。RNNCell创建了“递归神经网络导论”中描述的输入隐藏和隐藏权重矩阵。对RNNCell的每次调用都接受一个输入向量矩阵和一个隐藏向量矩阵。它返回一个步骤产生的隐藏向量矩阵。

除了控制RNN中的输入和隐藏大小超参数外,还有一个布尔参数用于指定批处理维度是否位于第0维度。这个标志也出现在所有PyTorch RNNs实现中。当设为真时,RNN交换输入张量的第0维和第1维。



Example 6-1. An implementation of the Elman RNN using PyTorch’s RNNCell

  1. class ElmanRNN(nn.Module):
  2. """ an Elman RNN built using the RNNCell """
  3. def __init__(self, input_size, hidden_size, batch_first=False):
  4. """
  5. Args:
  6. input_size (int): size of the input vectors
  7. hidden_size (int): size of the hidden state vectors
  8. bathc_first (bool): whether the 0th dimension is batch
  9. """
  10. super(ElmanRNN, self).__init__()
  11. self.rnn_cell = nn.RNNCell(input_size, hidden_size)
  12. self.batch_first = batch_first
  13. self.hidden_size = hidden_size
  14. def _initialize_hidden(self, batch_size):
  15. return torch.zeros((batch_size, self.hidden_size))
  16. def forward(self, x_in, initial_hidden=None):
  17. """The forward pass of the ElmanRNN
  18. Args:
  19. x_in (torch.Tensor): an input data tensor.
  20. If self.batch_first: x_in.shape = (batch_size, seq_size, feat_size)
  21. Else: x_in.shape = (seq_size, batch_size, feat_size)
  22. initial_hidden (torch.Tensor): the initial hidden state for the RNN
  23. Returns:
  24. hiddens (torch.Tensor): The outputs of the RNN at each time step.
  25. If self.batch_first:
  26. hiddens.shape = (batch_size, seq_size, hidden_size)
  27. Else: hiddens.shape = (seq_size, batch_size, hidden_size)
  28. """
  29. if self.batch_first:
  30. batch_size, seq_size, feat_size = x_in.size()
  31. x_in = x_in.permute(1, 0, 2)
  32. else:
  33. seq_size, batch_size, feat_size = x_in.size()
  34. hiddens = []
  35. if initial_hidden is None:
  36. initial_hidden = self._initialize_hidden(batch_size)
  37. initial_hidden =
  38. hidden_t = initial_hidden
  39. for t in range(seq_size):
  40. hidden_t = self.rnn_cell(x_in[t], hidden_t)
  41. hiddens.append(hidden_t)
  42. hiddens = torch.stack(hiddens)
  43. if self.batch_first:
  44. hiddens = hiddens.permute(1, 0, 2)
  45. return hiddens

Example: Classifying Surname Nationality using a Character RNN


The Surnames Dataset



Example 6-2. Implementing the Dataset for the Surname data

  1. class SurnameDataset(Dataset):
  2. @classmethod
  3. def load_dataset_and_make_vectorizer(cls, surname_csv):
  4. """Load dataset and make a new vectorizer from scratch
  5. Args:
  6. surname_csv (str): location of the dataset
  7. Returns:
  8. an instance of SurnameDataset
  9. """
  10. surname_df = pd.read_csv(surname_csv)
  11. train_surname_df = surname_df[surname_df.split=='train']
  12. return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))
  13. def __getitem__(self, index):
  14. """the primary entry point method for PyTorch datasets
  15. Args:
  16. index (int): the index to the data point
  17. Returns:
  18. a dictionary holding the data point's:
  19. features (x_data)
  20. label (y_target)
  21. feature length (x_length)
  22. """
  23. row = self._target_df.iloc[index]
  24. surname_vector, vec_length = \
  25. self._vectorizer.vectorize(row.surname, self._max_seq_length)
  26. nationality_index = \
  27. self._vectorizer.nationality_vocab.lookup_token(row.nationality)
  28. return {'x_data': surname_vector,
  29. 'y_target': nationality_index,
  30. 'x_length': vec_length}

The Vectorization Data Structures



Example 6-3. A vectorizer for surnames

  1. class SurnameVectorizer(object):
  2. """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
  3. def vectorize(self, surname, vector_length=-1):
  4. """
  5. Args:
  6. title (str): the string of characters
  7. vector_length (int): an argument for forcing the length of index vector
  8. """
  9. indices = [self.char_vocab.begin_seq_index]
  10. indices.extend(self.char_vocab.lookup_token(token)
  11. for token in surname)
  12. indices.append(self.char_vocab.end_seq_index)
  13. if vector_length < 0:
  14. vector_length = len(indices)
  15. out_vector = np.zeros(vector_length, dtype=np.int64)
  16. out_vector[:len(indices)] = indices
  17. out_vector[len(indices):] = self.char_vocab.mask_index
  18. return out_vector, len(indices)
  19. @classmethod
  20. def from_dataframe(cls, surname_df):
  21. """Instantiate the vectorizer from the dataset dataframe
  22. Args:
  23. surname_df (pandas.DataFrame): the surnames dataset
  24. Returns:
  25. an instance of the SurnameVectorizer
  26. """
  27. char_vocab = SequenceVocabulary()
  28. nationality_vocab = Vocabulary()
  29. for index, row in surname_df.iterrows():
  30. for char in row.surname:
  31. char_vocab.add_token(char)
  32. nationality_vocab.add_token(row.nationality)
  33. return cls(char_vocab, nationality_vocab)

The SurnameClassifier Model



Example 6-4. Implementing the Surname Classifier Model

  1. class SurnameClassifier(nn.Module):
  2. """ An RNN to extract features & a MLP to classify """
  3. def __init__(self, embedding_size, num_embeddings, num_classes,
  4. rnn_hidden_size, batch_first=True, padding_idx=0):
  5. """
  6. Args:
  7. embedding_size (int): The size of the character embeddings
  8. num_embeddings (int): The number of characters to embed
  9. num_classes (int): The size of the prediction vector
  10. Note: the number of nationalities
  11. rnn_hidden_size (int): The size of the RNN's hidden state
  12. batch_first (bool): Informs whether the input tensors will
  13. have batch or the sequence on the 0th dimension
  14. padding_idx (int): The index for the tensor padding;
  15. see torch.nn.Embedding
  16. """
  17. super(SurnameClassifier, self).__init__()
  18. self.emb = nn.Embedding(num_embeddings=num_embeddings,
  19. embedding_dim=embedding_size,
  20. padding_idx=padding_idx)
  21. self.rnn = ElmanRNN(input_size=embedding_size,
  22. hidden_size=rnn_hidden_size,
  23. batch_first=batch_first)
  24. self.fc1 = nn.Linear(in_features=rnn_hidden_size,
  25. out_features=rnn_hidden_size)
  26. self.fc2 = nn.Linear(in_features=rnn_hidden_size,
  27. out_features=num_classes)
  28. def forward(self, x_in, x_lengths=None, apply_softmax=False):
  29. """The forward pass of the classifier
  30. Args:
  31. x_in (torch.Tensor): an input data tensor.
  32. x_in.shape should be (batch, input_dim)
  33. x_lengths (torch.Tensor): the lengths of each sequence in the batch.
  34. They are used to find the final vector of each sequence
  35. apply_softmax (bool): a flag for the softmax activation
  36. should be false if used with the Cross Entropy losses
  37. Returns:
  38. out (torch.Tensor); `out.shape = (batch, num_classes)`
  39. """
  40. x_embedded = self.emb(x_in)
  41. y_out = self.rnn(x_embedded)
  42. if x_lengths is not None:
  43. y_out = column_gather(y_out, x_lengths)
  44. else:
  45. y_out = y_out[:, -1, :]
  46. y_out = F.dropout(y_out, 0.5)
  47. y_out = F.relu(self.fc1(y_out))
  48. y_out = F.dropout(y_out, 0.5)
  49. y_out = self.fc2(y_out)
  50. if apply_softmax:
  51. y_out = F.softmax(y_out, dim=1)
  52. return y_out

您将注意到,正向函数需要序列的长度。长度用于检索从RNN返回的带有名为column_gather函数的张量中每个序列的最终向量,如示例6-5所示。该函数迭代批处理行索引,并检索位于序列相应长度所指示位置的向量。 Example 6-5. Retrieving the final vector in each sequence using column_gather

  1. def column_gather(y_out, x_lengths):
  2. '''Get a specific vector from each batch datapoint in `y_out`.
  3. Args:
  4. y_out (torch.FloatTensor, torch.cuda.FloatTensor)
  5. shape: (batch, sequence, feature)
  6. x_lengths (torch.LongTensor, torch.cuda.LongTensor)
  7. shape: (batch,)
  8. Returns:
  9. y_out (torch.FloatTensor, torch.cuda.FloatTensor)
  10. shape: (batch, feature)
  11. '''
  12. x_lengths = x_lengths.long().detach().cpu().numpy() - 1
  13. out = []
  14. for batch_index, column_index in enumerate(x_lengths):
  15. out.append(y_out[batch_index, column_index])
  16. return torch.stack(out)

The Training Routine and Results

训练程序遵循标准公式。对于单个批数据,应用模型并计算预测向量。利用横熵损失和地面真值来计算损失值。使用损失值和优化器,计算梯度并使用这些梯度更新模型的权重。对训练数据中的每批重复此操作。对验证数据进行类似的处理,但是将模型设置为eval()模式,以防止在验证数据上反向传播。相反,验证数据仅用于对模型的执行情况给出不那么偏颇的感觉。这个例程在特定的时期重复执行。代码见补充资料我们鼓励您使用超参数来了解影响性能的因素以及影响程度,并将结果制成表格。我们还将为该任务编写合适的基线模型作为练习,让您完成。在“SurnameClassifier模型”中实现的模型是通用的,并不局限于字符。模型中的嵌入层可以映射出离散项序列中的任意离散项;例如,一个句子是一系列的单词。我们鼓励您在其他序列分类任务(如句子分类)中使用示例6-6中的代码。 Example 6-6. Arguments to the RNN-based Surname Classifier

  1. args = Namespace(
  2. # Data and path information
  3. surname_csv="data/surnames/surnames_with_splits.csv",
  4. vectorizer_file="vectorizer.json",
  5. model_state_file="model.pth",
  6. save_dir="model_storage/ch6/surname_classification",
  7. # Model hyper parameter
  8. char_embedding_size=100,
  9. rnn_hidden_size=64,
  10. # Training hyper parameter
  11. num_epochs=100,
  12. learning_rate=1e-3,
  13. batch_size=64,
  14. seed=1337,
  15. early_stopping_criteria=5,
  16. # ... Runtime options not shown for space
  17. )


在本章中,您学习了用于对序列数据建模的递归神经网络,以及最简单的一种递归网络,即Elman RNNs。我们确定序列建模的目标是学习序列的表示(即向量)。根据任务的不同,可以以不同的方式使用这种学习过的表示。我们考虑了一个示例任务,涉及到将这种隐藏状态表示分类到许多类中的一个。姓氏分类任务展示了一个使用RNNs在子词级别捕获信息的示例。