2.4 探索文本语料库

2中,我们看到了我们如何在已标注的语料库中提取匹配的特定的词性标记序列的短语。我们可以使用词块划分器更容易的做同样的工作,如下:

  1. >>> cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
  2. >>> brown = nltk.corpus.brown
  3. >>> for sent in brown.tagged_sents():
  4. ... tree = cp.parse(sent)
  5. ... for subtree in tree.subtrees():
  6. ... if subtree.label() == 'CHUNK': print(subtree)
  7. ...
  8. (CHUNK combined/VBN to/TO achieve/VB)
  9. (CHUNK continue/VB to/TO place/VB)
  10. (CHUNK serve/VB to/TO protect/VB)
  11. (CHUNK wanted/VBD to/TO wait/VB)
  12. (CHUNK allowed/VBN to/TO place/VB)
  13. (CHUNK expected/VBN to/TO become/VB)
  14. ...
  15. (CHUNK seems/VBZ to/TO overtake/VB)
  16. (CHUNK want/VB to/TO buy/VB)

注意

轮到你来:将上面的例子封装在函数find_chunks()内,以一个如"CHUNK: {&lt;V.*&gt; &lt;TO&gt; &lt;V.*&gt;}"的词块字符串作为参数。Use it to search the corpus for several other patterns, such as four or more nouns in a row, e.g. "NOUNS: {&lt;N.*&gt;{4,}}"