2.2 按文体计数词汇

1中,我们看到一个条件频率分布,其中条件为布朗语料库的每一节,并对每节计数词汇。FreqDist()以一个简单的列表作为输入,ConditionalFreqDist() 以一个配对列表作为输入。

  1. >>> from nltk.corpus import brown
  2. >>> cfd = nltk.ConditionalFreqDist(
  3. ... (genre, word)
  4. ... for genre in brown.categories()
  5. ... for word in brown.words(categories=genre))

让我们拆开来看,只看两个文体,新闻和言情。对于每个文体[2],我们遍历文体中的每个词[3],以产生文体与词的配对[1]

  1. >>> genre_word = [(genre, word) ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
  2. ... for genre in ['news', 'romance'] ![[2]](/projects/nlp-py-2e-zh/Images/6efeadf518b11a6441906b93844c2b19.jpg)
  3. ... for word in brown.words(categories=genre)] ![[3]](/projects/nlp-py-2e-zh/Images/e941b64ed778967dd0170d25492e42df.jpg)
  4. >>> len(genre_word)
  5. 170576

因此,在下面的代码中我们可以看到,列表genre_word的前几个配对将是 ('news', word) [1]的形式,而最后几个配对将是 ('romance', word) [2]的形式。

  1. >>> genre_word[:4]
  2. [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre]
  3. >>> genre_word[-4:]
  4. [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre]

现在,我们可以使用此配对列表创建一个ConditionalFreqDist,并将它保存在一个变量cfd中。像往常一样,我们可以输入变量的名称来检查它[1],并确认它有两个条件[2]

  1. >>> cfd = nltk.ConditionalFreqDist(genre_word)
  2. >>> cfd ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
  3. <ConditionalFreqDist with 2 conditions>
  4. >>> cfd.conditions()
  5. ['news', 'romance'] # [_conditions-cfd]

让我们访问这两个条件,它们每一个都只是一个频率分布:

  1. >>> print(cfd['news'])
  2. <FreqDist with 14394 samples and 100554 outcomes>
  3. >>> print(cfd['romance'])
  4. <FreqDist with 8452 samples and 70022 outcomes>
  5. >>> cfd['romance'].most_common(20)
  6. [(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502),
  7. ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993),
  8. ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690),
  9. ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]
  10. >>> cfd['romance']['could']
  11. 193