十八、自然语言处理

原文:Natural Language Processing

译者:飞龙

协议:CC BY-NC-SA 4.0

自然语言处理(NLP)是使用计算机分析文本数据的方法。

这是维基百科上的自然语言处理。

NTLK:自然语言工具包

NLTK 是用于文本分析的主要 Python 模块。

NLTK 组织网站在这里,他们有一整本教程在这里

NLTK

NLTK 为超过 50 种语料库和词汇资源(如 WordNet)提供了易于使用的界面,以及一套用于分类,分词,词干提取,标注,解析和语义推理的文本处理库,用于工业级 NLP 库的包装器,和活跃的讨论论坛。

  1. # 导入 NLTK
  2. import nltk

在这个笔记本中,我们将使用 NLTK 包中的一些有用功能来完成一些基本的文本分析。

要处理文本数据,通常需要使用语料库 - 文本数据集进行比较。 NLTK 有许多这样的数据集可用,但默认情况下不会安装它们(因为它们的完整集合会非常大)。下面我们将下载其中一些数据集。

  1. # 如果你在下面的单元格中下载时,遇到了错误
  2. # 请返回此单元格,取消注释,然后运行此代码。
  3. # 这段代码赋予 python 写入磁盘的权限(如果它还没有这样做的权限)。
  4. import ssl
  5. try:
  6. _create_unverified_https_context = ssl._create_unverified_context
  7. except AttributeError:
  8. pass
  9. else:
  10. ssl._create_default_https_context = _create_unverified_https_context
  11. # 从 NLTK 下载一些有用的数据文件
  12. nltk.download('punkt')
  13. nltk.download('stopwords')
  14. nltk.download('averaged_perceptron_tagger')
  15. nltk.download('maxent_ne_chunker')
  16. nltk.download('words')
  17. nltk.download('treebank')
  18. '''
  19. [nltk_data] Downloading package punkt to /Users/tom/nltk_data...
  20. [nltk_data] Package punkt is already up-to-date!
  21. [nltk_data] Downloading package stopwords to /Users/tom/nltk_data...
  22. [nltk_data] Package stopwords is already up-to-date!
  23. [nltk_data] Downloading package averaged_perceptron_tagger to
  24. [nltk_data] /Users/tom/nltk_data...
  25. [nltk_data] Package averaged_perceptron_tagger is already up-to-
  26. [nltk_data] date!
  27. [nltk_data] Downloading package maxent_ne_chunker to
  28. [nltk_data] /Users/tom/nltk_data...
  29. [nltk_data] Package maxent_ne_chunker is already up-to-date!
  30. [nltk_data] Downloading package words to /Users/tom/nltk_data...
  31. [nltk_data] Package words is already up-to-date!
  32. [nltk_data] Downloading package treebank to /Users/tom/nltk_data...
  33. [nltk_data] Package treebank is already up-to-date!
  34. True
  35. '''
  36. # 设置一些要测试的数据的测试句子
  37. sentence = "UC San Diego is a great place to study cognitive science."

分词

分词是将文本数据拆分为“标记”的过程,这些标记是有意义的数据片段。

分词的更多信息在这里

词可以在不同的级别完成 - 例如,你可以将文本分词为句子,和/或分词为单词。

  1. # 在单词级别对我们的句子进行分词
  2. tokens = nltk.word_tokenize(sentence)
  3. # 查看单词分词后的数据
  4. print(tokens)
  5. # ['UC', 'San', 'Diego', 'is', 'a', 'great', 'place', 'to', 'study', 'cognitive', 'science', '.']

词性(POS)标注

词性标注是根据单词的“类型”和与其他单词的关系对单词进行标记的过程。

这里是维基百科上的词性标注。

  1. # 对我们的句子进行词性标注
  2. tags = nltk.pos_tag(tokens)
  3. # 检查我们的数据的 POS 标签
  4. print(tags)
  5. # [('UC', 'NNP'), ('San', 'NNP'), ('Diego', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('place', 'NN'), ('to', 'TO'), ('study', 'VB'), ('cognitive', 'JJ'), ('science', 'NN'), ('.', '.')]
  6. # 查看描述所有缩写含义的文档
  7. nltk.help.upenn_tagset()
  8. '''
  9. $: dollar
  10. $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
  11. '': closing quotation mark
  12. ' ''
  13. (: opening parenthesis
  14. ( [ {
  15. ): closing parenthesis
  16. ) ] }
  17. ,: comma
  18. ,
  19. --: dash
  20. --
  21. .: sentence terminator
  22. . ! ?
  23. :: colon or ellipsis
  24. : ; ...
  25. CC: conjunction, coordinating
  26. & 'n and both but either et for less minus neither nor or plus so
  27. therefore times v. versus vs. whether yet
  28. CD: numeral, cardinal
  29. mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
  30. seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
  31. fifteen 271,124 dozen quintillion DM2,000 ...
  32. DT: determiner
  33. all an another any both del each either every half la many much nary
  34. neither no some such that the them these this those
  35. EX: existential there
  36. there
  37. FW: foreign word
  38. gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
  39. lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
  40. terram fiche oui corporis ...
  41. IN: preposition or conjunction, subordinating
  42. astride among uppon whether out inside pro despite on by throughout
  43. below within for towards near behind atop around if like until below
  44. next into if beside ...
  45. JJ: adjective or numeral, ordinal
  46. third ill-mannered pre-war regrettable oiled calamitous first separable
  47. ectoplasmic battery-powered participatory fourth still-to-be-named
  48. multilingual multi-disciplinary ...
  49. JJR: adjective, comparative
  50. bleaker braver breezier briefer brighter brisker broader bumper busier
  51. calmer cheaper choosier cleaner clearer closer colder commoner costlier
  52. cozier creamier crunchier cuter ...
  53. JJS: adjective, superlative
  54. calmest cheapest choicest classiest cleanest clearest closest commonest
  55. corniest costliest crassest creepiest crudest cutest darkest deadliest
  56. dearest deepest densest dinkiest ...
  57. LS: list item marker
  58. A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
  59. SP-44007 Second Third Three Two * a b c d first five four one six three
  60. two
  61. MD: modal auxiliary
  62. can cannot could couldn't dare may might must need ought shall should
  63. shouldn't will would
  64. NN: noun, common, singular or mass
  65. common-carrier cabbage knuckle-duster Casino afghan shed thermostat
  66. investment slide humour falloff slick wind hyena override subhumanity
  67. machinist ...
  68. NNP: noun, proper, singular
  69. Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
  70. Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
  71. Shannon A.K.C. Meltex Liverpool ...
  72. NNPS: noun, proper, plural
  73. Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
  74. Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
  75. Apache Apaches Apocrypha ...
  76. NNS: noun, common, plural
  77. undergraduates scotches bric-a-brac products bodyguards facets coasts
  78. divestitures storehouses designs clubs fragrances averages
  79. subjectivists apprehensions muses factory-jobs ...
  80. PDT: pre-determiner
  81. all both half many quite such sure this
  82. POS: genitive marker
  83. ' 's
  84. PRP: pronoun, personal
  85. hers herself him himself hisself it itself me myself one oneself ours
  86. ourselves ownself self she thee theirs them themselves they thou thy us
  87. PRP$: pronoun, possessive
  88. her his mine my our ours their thy your
  89. RB: adverb
  90. occasionally unabatingly maddeningly adventurously professedly
  91. stirringly prominently technologically magisterially predominately
  92. swiftly fiscally pitilessly ...
  93. RBR: adverb, comparative
  94. further gloomier grander graver greater grimmer harder harsher
  95. healthier heavier higher however larger later leaner lengthier less-
  96. perfectly lesser lonelier longer louder lower more ...
  97. RBS: adverb, superlative
  98. best biggest bluntest earliest farthest first furthest hardest
  99. heartiest highest largest least less most nearest second tightest worst
  100. RP: particle
  101. aboard about across along apart around aside at away back before behind
  102. by crop down ever fast for forth from go high i.e. in into just later
  103. low more off on open out over per pie raising start teeth that through
  104. under unto up up-pp upon whole with you
  105. SYM: symbol
  106. % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
  107. TO: "to" as preposition or infinitive marker
  108. to
  109. UH: interjection
  110. Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
  111. huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
  112. man baby diddle hush sonuvabitch ...
  113. VB: verb, base form
  114. ask assemble assess assign assume atone attention avoid bake balkanize
  115. bank begin behold believe bend benefit bevel beware bless boil bomb
  116. boost brace break bring broil brush build ...
  117. VBD: verb, past tense
  118. dipped pleaded swiped regummed soaked tidied convened halted registered
  119. cushioned exacted snubbed strode aimed adopted belied figgered
  120. speculated wore appreciated contemplated ...
  121. VBG: verb, present participle or gerund
  122. telegraphing stirring focusing angering judging stalling lactating
  123. hankerin' alleging veering capping approaching traveling besieging
  124. encrypting interrupting erasing wincing ...
  125. VBN: verb, past participle
  126. multihulled dilapidated aerosolized chaired languished panelized used
  127. experimented flourished imitated reunifed factored condensed sheared
  128. unsettled primed dubbed desired ...
  129. VBP: verb, present tense, not 3rd person singular
  130. predominate wrap resort sue twist spill cure lengthen brush terminate
  131. appear tend stray glisten obtain comprise detest tease attract
  132. emphasize mold postpone sever return wag ...
  133. VBZ: verb, present tense, 3rd person singular
  134. bases reconstructs marks mixes displeases seals carps weaves snatches
  135. slumps stretches authorizes smolders pictures emerges stockpiles
  136. seduces fizzes uses bolsters slaps speaks pleads ...
  137. WDT: WH-determiner
  138. that what whatever which whichever
  139. WP: WH-pronoun
  140. that what whatever whatsoever which who whom whosoever
  141. WP$: WH-pronoun, possessive
  142. whose
  143. WRB: Wh-adverb
  144. how however whence whenever where whereby whereever wherein whereof why
  145. ``: opening quotation mark
  146. '''

命名实体识别(NER)

命名实体识别旨在用与相关的实体类型标记单词。

这里是维基百科上的命名实体识别。

  1. # 将命名实体识别应用于我们的 POS 标签
  2. entities = nltk.chunk.ne_chunk(tags)
  3. # 查看命名实体
  4. print(entities)
  5. '''
  6. (S
  7. UC/NNP
  8. (PERSON San/NNP Diego/NNP)
  9. is/VBZ
  10. a/DT
  11. great/JJ
  12. place/NN
  13. to/TO
  14. study/VB
  15. cognitive/JJ
  16. science/NN
  17. ./.)
  18. '''

停止词

“停止词”是一种语言中最常见的词语,我们经常希望在文本分析之前将其过滤掉。

这里是维基百科上的停止词。

  1. # 查看英语中的停止词语料库
  2. print(nltk.corpus.stopwords.words('english'))
  3. # ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

文本编码

NLP的关键组件之一是决定如何编码文本数据。

常见的编码是:

  • 词袋(BoW)
    • 文本被编码为单词和频率的集合
  • 词频-逆向文件频率(TF-IDF)
    • TF-IDF是一种加权,用于存储单词和语料库中的共性关系。

我们将浏览 BoW 和 TF-IDF 文本编码的示例。

  1. # 导入
  2. %matplotlib inline
  3. # 标准 Python 有一些有用的字符串工具
  4. import string
  5. # 集合是标准 Python 的一部分,包含一些有用的数据对象
  6. from collections import Counter
  7. import numpy as np
  8. import matplotlib.pyplot as plt
  9. # Scikit-learn 有一些有用的 NLP 工具,例如 TFIDF 向量化器
  10. from sklearn.feature_extraction.text import TfidfVectorizer

我们将要查看的数据是 BookCorpus 数据集的一小部分。原始数据集可在此处找到:http://yknzhu.wixsite.com/mbweb

原始数据集是从超过 11,000 本书中收集的,并且已经在句子和单词级别上进行了分词。这里提供和使用的小子集包含前 10,000 个句子。

  1. # 加载数据
  2. with open('files/book10k.txt', 'r') as f:
  3. sents = f.readlines()
  4. # 查看数据 - 打印出第一个和最后一个句子,作为示例
  5. print(sents[0])
  6. print(sents[-1])
  7. '''
  8. the half-ling book one in the fall of igneeria series kaylee soderburg copyright 2013 kaylee soderburg all rights reserved .
  9. alejo was sure the fact that he was nervously repeating mass along with five wrinkly , age-encrusted spanish women meant that stalin was rethinking whether he was going to pay the price .
  10. '''
  11. # 预处理:从句子中删除所有额外的空格
  12. sents = [sent.strip() for sent in sents]

我们首先看一下文档中的单词频率,然后打印出频率最高的前 10 个单词。

  1. # 将所有句子分词为单词
  2. # 这会将所有单词标记收集到一个大的列表中
  3. tokens = []
  4. for x in sents:
  5. tokens.extend(nltk.word_tokenize(x))
  6. # 查看数据中有多少单词
  7. print('Number of words in the data: \t', len(tokens))
  8. print('Number of unique words: \t', len(set(tokens)))
  9. '''
  10. Number of words in the data: 140060
  11. Number of unique words: 8221
  12. '''
  13. # 使用“计数器”对象计算每个单词出现的次数
  14. counts = Counter(tokens)
  15. # 查看计数对象
  16. # 这基本上是这个语料库的“词袋”表示
  17. # 我们失去了单词顺序和语法 - 它只是单词的一个集合
  18. # 我们所拥有的是所有单词的列表,以及它们出现的频率
  19. counts

如果你滚动上面的单词列表,你可能会注意到的一点是,它仍然包含标点符号。我们删除那些。

  1. # 'string' 模块(标准库)有一个有用的标点符号列表
  2. print(string.punctuation)
  3. # !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
  4. # 从计数对象中删除所有标点符号标记
  5. for punc in string.punctuation:
  6. if punc in counts:
  7. counts.pop(punc)
  8. # 获得前 10 个最常用的单词
  9. top10 = counts.most_common(10)
  10. # 提取顶部单词,并计数
  11. top10_words = [it[0] for it in top10]
  12. top10_counts = [it[1] for it in top10]
  13. # 绘制文本中最常用单词的条形图
  14. plt.barh(top10_words, top10_counts)
  15. plt.title('Term Frequency');
  16. plt.xlabel('Frequency');

png

正如我们所看到的,文档中出现了thewasa等等。

对于弄清楚这些文档的内容,或者作为使用和理解这些文本数据的方式,这些经常出现的单词并不是很有用。

  1. # 丢弃所有停止词
  2. for stop in nltk.corpus.stopwords.words('english'):
  3. if stop in counts:
  4. counts.pop(stop)
  5. # 获取删除停止词的数据中,前 20 个最常用单词
  6. top20 = counts.most_common(20)
  7. # 绘制文本中最常用单词的条形图
  8. plt.barh([it[0] for it in top20], [it[1] for it in top20])
  9. plt.title('Term Frequency');
  10. plt.xlabel('Frequency');

png

这看起来可能更相关/有用。我们可以继续探索这个 BoW 模型,但现在让我们转向,并使用 TFIDF 进行探索。

  1. # 初始化 TFIDF 对象
  2. tfidf = TfidfVectorizer(analyzer='word',
  3. sublinear_tf=True,
  4. max_features=5000,
  5. tokenizer=nltk.word_tokenize)
  6. # 将 TFIDF 转换应用于我们的数据
  7. # 请注意,这会接受句子并对其进行分词,然后应用 TFIDF
  8. tfidf_books = tfidf.fit_transform(sents).toarray()

TfidfVectorizer 将计算每个单词的逆文档频率(IDF)。

然后 TFIDF 计算为TF * IDF,用于降低频繁出现的单词的权重。该 TFIDF 存储在tfidf_books变量中,该变量是一个n_documents x n_words矩阵,用于以 TFIDF 表示来编码文档。

让我们首先为前 10 个最常出现的单词(来自第一次分析)中的每一个绘制 IDF。

  1. # 获取前 10 个最常用单词的 IDF 权重
  2. IDF_weights = [tfidf.idf_[tfidf.vocabulary_[token]] for token in top10_words]
  3. # 绘制非常常见的单词的 IDF 得分
  4. plt.barh(top10_words, IDF_weights)
  5. plt.title('Inverse Document Frequency');
  6. plt.xlabel('IDF Score');

png

我们将该绘图与以下绘图进行比较,该绘图显示 IDF 最高的前 10 个单词。

  1. # 获得 IDF 得分最高的单词
  2. inds = np.argsort(tfidf.idf_)[::-1][:10]
  3. top_IDF_tokens = [list(tfidf.vocabulary_)[ind] for ind in inds]
  4. top_IDF_scores = tfidf.idf_[inds]
  5. # 绘制 IDF 得分最高的单词
  6. plt.barh(top_IDF_tokens, top_IDF_scores)
  7. plt.title('Inverse Document Frequency');
  8. plt.xlabel('IDF Score');

png

正如我们所看到的,与更罕见的单词相比,文档中经常出现的单词获得的 IDF 分数非常低。

在 TF-IDF 之后,我们成功地减少了文档中频繁出现的单词的权重。这允许我们通过最独特的单词来表示文档,这可以是表示文本数据的更有用的方式。