3.3 训练基于分类器的词块划分器

无论是基于正则表达式的词块划分器还是 n-gram 词块划分器,决定创建什么词块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何划分词块。例如,考虑下面的两个语句:

  1. class ConsecutiveNPChunkTagger(nltk.TaggerI): ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)
  2. def __init__(self, train_sents):
  3. train_set = []
  4. for tagged_sent in train_sents:
  5. untagged_sent = nltk.tag.untag(tagged_sent)
  6. history = []
  7. for i, (word, tag) in enumerate(tagged_sent):
  8. featureset = npchunk_features(untagged_sent, i, history) ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)
  9. train_set.append( (featureset, tag) )
  10. history.append(tag)
  11. self.classifier = nltk.MaxentClassifier.train( ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)
  12. train_set, algorithm='megam', trace=0)
  13. def tag(self, sentence):
  14. history = []
  15. for i, word in enumerate(sentence):
  16. featureset = npchunk_features(sentence, i, history)
  17. tag = self.classifier.classify(featureset)
  18. history.append(tag)
  19. return zip(sentence, history)
  20. class ConsecutiveNPChunker(nltk.ChunkParserI): ![[4]](Images/8b4bb6b0ec5bb337fdb00c31efcc1645.jpg)
  21. def __init__(self, train_sents):
  22. tagged_sents = [[((w,t),c) for (w,t,c) in
  23. nltk.chunk.tree2conlltags(sent)]
  24. for sent in train_sents]
  25. self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
  26. def parse(self, sentence):
  27. tagged_sents = self.tagger.tag(sentence)
  28. conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
  29. return nltk.chunk.conlltags2tree(conlltags)

留下来唯一需要填写的是特征提取器。首先,我们定义一个简单的特征提取器,它只是提供了当前词符的词性标记。使用此特征提取器,我们的基于分类器的词块划分器的表现与一元词块划分器非常类似:

  1. >>> def npchunk_features(sentence, i, history):
  2. ... word, pos = sentence[i]
  3. ... return {"pos": pos}
  4. >>> chunker = ConsecutiveNPChunker(train_sents)
  5. >>> print(chunker.evaluate(test_sents))
  6. ChunkParse score:
  7. IOB Accuracy: 92.9%
  8. Precision: 79.9%
  9. Recall: 86.7%
  10. F-Measure: 83.2%

我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用,由此产生的词块划分器与二元词块划分器非常接近。

  1. >>> def npchunk_features(sentence, i, history):
  2. ... word, pos = sentence[i]
  3. ... if i == 0:
  4. ... prevword, prevpos = "<START>", "<START>"
  5. ... else:
  6. ... prevword, prevpos = sentence[i-1]
  7. ... return {"pos": pos, "prevpos": prevpos}
  8. >>> chunker = ConsecutiveNPChunker(train_sents)
  9. >>> print(chunker.evaluate(test_sents))
  10. ChunkParse score:
  11. IOB Accuracy: 93.6%
  12. Precision: 81.9%
  13. Recall: 87.2%
  14. F-Measure: 84.5%

下一步,我们将尝试为当前词增加特征,因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现,大约 1.5 个百分点(相应的错误率减少大约 10%)。

  1. >>> def npchunk_features(sentence, i, history):
  2. ... word, pos = sentence[i]
  3. ... if i == 0:
  4. ... prevword, prevpos = "<START>", "<START>"
  5. ... else:
  6. ... prevword, prevpos = sentence[i-1]
  7. ... return {"pos": pos, "word": word, "prevpos": prevpos}
  8. >>> chunker = ConsecutiveNPChunker(train_sents)
  9. >>> print(chunker.evaluate(test_sents))
  10. ChunkParse score:
  11. IOB Accuracy: 94.5%
  12. Precision: 84.2%
  13. Recall: 89.4%
  14. F-Measure: 86.7%

最后,我们尝试用多种附加特征扩展特征提取器,例如预取特征[1]、配对特征[2]和复杂的语境特征[3]。这最后一个特征,称为tags-since-dt,创建一个字符串,描述自最近的限定词以来遇到的所有词性标记,或如果没有限定词则在索引i之前自语句开始以来遇到的所有词性标记。

  1. >>> def npchunk_features(sentence, i, history):
  2. ... word, pos = sentence[i]
  3. ... if i == 0:
  4. ... prevword, prevpos = "<START>", "<START>"
  5. ... else:
  6. ... prevword, prevpos = sentence[i-1]
  7. ... if i == len(sentence)-1:
  8. ... nextword, nextpos = "<END>", "<END>"
  9. ... else:
  10. ... nextword, nextpos = sentence[i+1]
  11. ... return {"pos": pos,
  12. ... "word": word,
  13. ... "prevpos": prevpos,
  14. ... "nextpos": nextpos, ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)
  15. ... "prevpos+pos": "%s+%s" % (prevpos, pos), ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)
  16. ... "pos+nextpos": "%s+%s" % (pos, nextpos),
  17. ... "tags-since-dt": tags_since_dt(sentence, i)} ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)
  1. >>> def tags_since_dt(sentence, i):
  2. ... tags = set()
  3. ... for word, pos in sentence[:i]:
  4. ... if pos == 'DT':
  5. ... tags = set()
  6. ... else:
  7. ... tags.add(pos)
  8. ... return '+'.join(sorted(tags))
  1. >>> chunker = ConsecutiveNPChunker(train_sents)
  2. >>> print(chunker.evaluate(test_sents))
  3. ChunkParse score:
  4. IOB Accuracy: 96.0%
  5. Precision: 88.6%
  6. Recall: 91.0%
  7. F-Measure: 89.8%

注意

轮到你来:尝试为特征提取器函数npchunk_features增加不同的特征,看看是否可以进一步改善 NP 词块划分器的表现。