7. 从文本提取信息 - 3.3 训练基于分类器的词块划分器 - 《Python 自然语言处理第二版》

3.3 训练基于分类器的词块划分器

3.3 训练基于分类器的词块划分器

无论是基于正则表达式的词块划分器还是 n-gram 词块划分器，决定创建什么词块完全基于词性标记。然而，有时词性标记不足以确定一个句子应如何划分词块。例如，考虑下面的两个语句：

class ConsecutiveNPChunkTagger(nltk.TaggerI): ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history) ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train( ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)
            train_set, algorithm='megam', trace=0)
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)
class ConsecutiveNPChunker(nltk.ChunkParserI): ![[4]](Images/8b4bb6b0ec5bb337fdb00c31efcc1645.jpg)
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

留下来唯一需要填写的是特征提取器。首先，我们定义一个简单的特征提取器，它只是提供了当前词符的词性标记。使用此特征提取器，我们的基于分类器的词块划分器的表现与一元词块划分器非常类似：

>>> def npchunk_features(sentence, i, history):
...     word, pos = sentence[i]
...     return {"pos": pos}
>>> chunker = ConsecutiveNPChunker(train_sents)
>>> print(chunker.evaluate(test_sents))
ChunkParse score:
 IOB Accuracy:  92.9%
 Precision:     79.9%
 Recall:        86.7%
 F-Measure:     83.2%

我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用，由此产生的词块划分器与二元词块划分器非常接近。

>>> def npchunk_features(sentence, i, history):
...     word, pos = sentence[i]
...     if i == 0:
...         prevword, prevpos = "<START>", "<START>"
...     else:
...         prevword, prevpos = sentence[i-1]
...     return {"pos": pos, "prevpos": prevpos}
>>> chunker = ConsecutiveNPChunker(train_sents)
>>> print(chunker.evaluate(test_sents))
ChunkParse score:
 IOB Accuracy:  93.6%
 Precision:     81.9%
 Recall:        87.2%
 F-Measure:     84.5%

下一步，我们将尝试为当前词增加特征，因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现，大约 1.5 个百分点（相应的错误率减少大约 10％）。

>>> def npchunk_features(sentence, i, history):
...     word, pos = sentence[i]
...     if i == 0:
...         prevword, prevpos = "<START>", "<START>"
...     else:
...         prevword, prevpos = sentence[i-1]
...     return {"pos": pos, "word": word, "prevpos": prevpos}
>>> chunker = ConsecutiveNPChunker(train_sents)
>>> print(chunker.evaluate(test_sents))
ChunkParse score:
 IOB Accuracy:  94.5%
 Precision:     84.2%
 Recall:        89.4%
 F-Measure:     86.7%

最后，我们尝试用多种附加特征扩展特征提取器，例如预取特征、配对特征和复杂的语境特征。这最后一个特征，称为tags-since-dt，创建一个字符串，描述自最近的限定词以来遇到的所有词性标记，或如果没有限定词则在索引i之前自语句开始以来遇到的所有词性标记。

>>> def npchunk_features(sentence, i, history):
...     word, pos = sentence[i]
...     if i == 0:
...         prevword, prevpos = "<START>", "<START>"
...     else:
...         prevword, prevpos = sentence[i-1]
...     if i == len(sentence)-1:
...         nextword, nextpos = "<END>", "<END>"
...     else:
...         nextword, nextpos = sentence[i+1]
...     return {"pos": pos,
...             "word": word,
...             "prevpos": prevpos,
...             "nextpos": nextpos, ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)
...             "prevpos+pos": "%s+%s" % (prevpos, pos),  ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)
...             "pos+nextpos": "%s+%s" % (pos, nextpos),
...             "tags-since-dt": tags_since_dt(sentence, i)}  ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)

>>> def tags_since_dt(sentence, i):
...     tags = set()
...     for word, pos in sentence[:i]:
...         if pos == 'DT':
...             tags = set()
...         else:
...             tags.add(pos)
...     return '+'.join(sorted(tags))

>>> chunker = ConsecutiveNPChunker(train_sents)
>>> print(chunker.evaluate(test_sents))
ChunkParse score:
 IOB Accuracy:  96.0%
 Precision:     88.6%
 Recall:        91.0%
 F-Measure:     89.8%

注意

轮到你来：尝试为特征提取器函数npchunk_features增加不同的特征，看看是否可以进一步改善 NP 词块划分器的表现。