3.3 训练基于分类器的词块划分器
无论是基于正则表达式的词块划分器还是 n-gram 词块划分器,决定创建什么词块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何划分词块。例如,考虑下面的两个语句:
class ConsecutiveNPChunkTagger(nltk.TaggerI): ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)
def __init__(self, train_sents):
train_set = []
for tagged_sent in train_sents:
untagged_sent = nltk.tag.untag(tagged_sent)
history = []
for i, (word, tag) in enumerate(tagged_sent):
featureset = npchunk_features(untagged_sent, i, history) ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)
train_set.append( (featureset, tag) )
history.append(tag)
self.classifier = nltk.MaxentClassifier.train( ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)
train_set, algorithm='megam', trace=0)
def tag(self, sentence):
history = []
for i, word in enumerate(sentence):
featureset = npchunk_features(sentence, i, history)
tag = self.classifier.classify(featureset)
history.append(tag)
return zip(sentence, history)
class ConsecutiveNPChunker(nltk.ChunkParserI): ![[4]](Images/8b4bb6b0ec5bb337fdb00c31efcc1645.jpg)
def __init__(self, train_sents):
tagged_sents = [[((w,t),c) for (w,t,c) in
nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
def parse(self, sentence):
tagged_sents = self.tagger.tag(sentence)
conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
return nltk.chunk.conlltags2tree(conlltags)
留下来唯一需要填写的是特征提取器。首先,我们定义一个简单的特征提取器,它只是提供了当前词符的词性标记。使用此特征提取器,我们的基于分类器的词块划分器的表现与一元词块划分器非常类似:
>>> def npchunk_features(sentence, i, history):
... word, pos = sentence[i]
... return {"pos": pos}
>>> chunker = ConsecutiveNPChunker(train_sents)
>>> print(chunker.evaluate(test_sents))
ChunkParse score:
IOB Accuracy: 92.9%
Precision: 79.9%
Recall: 86.7%
F-Measure: 83.2%
我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用,由此产生的词块划分器与二元词块划分器非常接近。
>>> def npchunk_features(sentence, i, history):
... word, pos = sentence[i]
... if i == 0:
... prevword, prevpos = "<START>", "<START>"
... else:
... prevword, prevpos = sentence[i-1]
... return {"pos": pos, "prevpos": prevpos}
>>> chunker = ConsecutiveNPChunker(train_sents)
>>> print(chunker.evaluate(test_sents))
ChunkParse score:
IOB Accuracy: 93.6%
Precision: 81.9%
Recall: 87.2%
F-Measure: 84.5%
下一步,我们将尝试为当前词增加特征,因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现,大约 1.5 个百分点(相应的错误率减少大约 10%)。
>>> def npchunk_features(sentence, i, history):
... word, pos = sentence[i]
... if i == 0:
... prevword, prevpos = "<START>", "<START>"
... else:
... prevword, prevpos = sentence[i-1]
... return {"pos": pos, "word": word, "prevpos": prevpos}
>>> chunker = ConsecutiveNPChunker(train_sents)
>>> print(chunker.evaluate(test_sents))
ChunkParse score:
IOB Accuracy: 94.5%
Precision: 84.2%
Recall: 89.4%
F-Measure: 86.7%
最后,我们尝试用多种附加特征扩展特征提取器,例如预取特征、配对特征和复杂的语境特征。这最后一个特征,称为tags-since-dt
,创建一个字符串,描述自最近的限定词以来遇到的所有词性标记,或如果没有限定词则在索引i
之前自语句开始以来遇到的所有词性标记。
>>> def npchunk_features(sentence, i, history):
... word, pos = sentence[i]
... if i == 0:
... prevword, prevpos = "<START>", "<START>"
... else:
... prevword, prevpos = sentence[i-1]
... if i == len(sentence)-1:
... nextword, nextpos = "<END>", "<END>"
... else:
... nextword, nextpos = sentence[i+1]
... return {"pos": pos,
... "word": word,
... "prevpos": prevpos,
... "nextpos": nextpos, ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)
... "prevpos+pos": "%s+%s" % (prevpos, pos), ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)
... "pos+nextpos": "%s+%s" % (pos, nextpos),
... "tags-since-dt": tags_since_dt(sentence, i)} ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)
>>> def tags_since_dt(sentence, i):
... tags = set()
... for word, pos in sentence[:i]:
... if pos == 'DT':
... tags = set()
... else:
... tags.add(pos)
... return '+'.join(sorted(tags))
>>> chunker = ConsecutiveNPChunker(train_sents)
>>> print(chunker.evaluate(test_sents))
ChunkParse score:
IOB Accuracy: 96.0%
Precision: 88.6%
Recall: 91.0%
F-Measure: 89.8%
注意
轮到你来:尝试为特征提取器函数npchunk_features
增加不同的特征,看看是否可以进一步改善 NP 词块划分器的表现。