3 处理原始文本 - 3.4 使用正则表达式检测词组搭配 - 《Python 自然语言处理第二版》

3.4 使用正则表达式检测词组搭配

3.4 使用正则表达式检测词组搭配

许多语言处理任务都涉及模式匹配。例如：我们可以使用endswith('ed')找到以 ed 结尾的词。在4.2中我们看到过各种这样的“词测试”。正则表达式给我们一个更加强大和灵活的方法描述我们感兴趣的字符模式。

注意

介绍正则表达式的其他出版物有很多，它们围绕正则表达式的语法组织，应用于搜索文本文件。我们不再赘述这些，只专注于在语言处理的不同阶段如何使用正则表达式。像往常一样，我们将采用基于问题的方式，只在解决实际问题需要时才介绍新特性。在我们的讨论中，我们将使用箭头来表示正则表达式，就像这样：«patt»。

在 Python 中使用正则表达式，需要使用import re导入re库。我们还需要一个用于搜索的词汇列表；我们再次使用词汇语料库(4)。我们将对它进行预处理消除某些名称。

>>> import re
>>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

使用基本的元字符

让我们使用正则表达式«ed 我们将使用函数re.search(p, s)检查字符串s中是否有模式p`。我们需要指定感兴趣的字符，然后使用美元符号，它是正则表达式中有特殊用途的符号，用来匹配单词的末尾：

>>> [w for w in wordlist if re.search('ed$', w)]
['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]

.通配符匹配任何单个字符。假设我们有一个 8 个字母组成的词的字谜室，j 是其第三个字母，t 是其第六个字母。空白单元格中的每个地方，我们用一个句点：

>>> [w for w in wordlist if re.search('^..j..t..$', w)]
['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', ...]

注意

轮到你来： 驼字符^匹配字符串的开始，就像如果我们不用这两个符号而使用«..j..t..`»搜索，刚才例子中我们会得到什么样的结果？

最后，?符合表示前面的字符是可选的。因此«^e-?mail 我们可以使用sum(1 for w in text if re.search(‘^e-?mail/font>, w))`计数一个文本中这个词（任一拼写形式）出现的总次数。

范围与闭包

图 3.5：T9：9 个键上的文本

T9 系统用于在手机上输入文本（见3.5)）。两个或两个以上以相同击键顺序输入的词汇，叫做 textonyms。例如，hole 和 golf 都是通过序列 4653 输入。还有哪些其它词汇由相同的序列产生？这里我们使用正则表达式«`^[ghi][mno][jlk][def]

>>> [w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]
['gold', 'golf', 'hold', 'hole']

表达式的第一部分«^[ghi]»匹配以 g, h 或 i 开始的词。表达式的下一部分，«[mno]»限制了第二个字符是 m, n 或 o。第三部分和第四部分同样被限制。只有 4 个单词满足这些限制。注意，方括号内的字符的顺序是没有关系的，所以我们可以写成«`^[hig][nom][ljk][fed]

注意

轮到你来： 来看一些“手指绕口令”，只用一部分数字键盘搜索词汇。例如«^[ghijklmno]+-和+`表示什么意思？

让我们进一步探索+符号。请注意，它可以适用于单个字母或括号内的字母集：

>>> chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
>>> [w for w in chat_words if re.search('^m+i+n+e+$', w)]
['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine',
'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']
>>> [w for w in chat_words if re.search('^[ha]+$', w)]
['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh',
'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa',
'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', ...]

很显然，+简单地表示“前面的项目的一个或多个实例”，它可以是单独的字母如m，可以是一个集合如[fed]或者一个范围如[d-f]。现在让我们用*替换+，它表示“前面的项目的零个或多个实例”。正则表达式«^m*i*n*e*me, min 和 mmmmm。请注意+和*`符号有时被称为的 Kleene 闭包，或者干脆闭包。

运算符^当它出现在方括号内的第一个字符位置时有另外的功能。例如，«[^aeiouAEIOU]»匹配除元音字母之外的所有字母。我们可以搜索 NPS 聊天语料库中完全由非元音字母组成的词汇，使用«`^[^aeiouAEIOU]+请注意其中包含非字母字符。

下面是另外一些正则表达式的例子，用来寻找匹配特定模式的词符，这些例子演示如何使用一些新的符号：\, {}, ()和|。

>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> [w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]
['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5',
'0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99',
'1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', ...]
>>> [w for w in wsj if re.search('^[A-Z]+\$$', w)]
['C$', 'US$']
>>> [w for w in wsj if re.search('^[0-9]{4}$', w)]
['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', ...]
>>> [w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]
['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', ...]
>>> [w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]
['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting',
'savings-and-loan']
>>> [w for w in wsj if re.search('(ed|ing)$', w)]
['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', ...]

注意

轮到你来： 研究前面的例子，在你继续阅读之前尝试弄清楚\, {}, ()和| 这些符号的功能。

你可能已经知道反斜杠表示其后面的字母不再有特殊的含义而是按照字面的表示匹配词中特定的字符。因此，虽然.很特别，但是\.只匹配一个句号。大括号表达式，如{3,5}, 表示前面的项目重复指定次数。管道字符表示从其左边的内容和右边的内容中选择一个。圆括号表示一个操作符的范围，它们可以与管道（或叫析取）符号一起使用，如«w(i|e|ai|oo)t»，匹配 wit, wet, wait 和 woot。你可以省略这个例子里的最后一个表达式中的括号，使用«`ed|ing

我们已经看到的元字符总结在3.3中：

表 3.3：

正则表达式基本元字符，其中包括通配符，范围和闭包

>>> word = 'supercalifragilisticexpialidocious'
>>> re.findall(r'[aeiou]', word)
['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
>>> len(re.findall(r'[aeiou]', word))
16

让我们来看看一些文本中的两个或两个以上的元音序列，并确定它们的相对频率：

>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> fd = nltk.FreqDist(vs for word in wsj
...                       for vs in re.findall(r'[aeiou]{2,}', word))
>>> fd.most_common(12)
[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253),
('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95)]

注意

轮到你来： 在 W3C 日期时间格式中，日期像这样表示：2009-12-31。Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers [2009, 12, 31]:

[int(n) for n in re.findall(?, '2009-12-31')]

在单词片段上做更多事情

一旦我们会使用re.findall()从单词中提取素材，就可以在这些片段上做一些有趣的事情，例如将它们粘贴在一起或用它们绘图。

英文文本是高度冗余的，忽略掉词内部的元音仍然可以很容易的阅读，有些时候这很明显。例如，declaration 变成 dclrtn，inalienable 变成 inlnble，保留所有词首或词尾的元音序列。在我们的下一个例子中，正则表达式匹配词首元音序列，词尾元音序列和所有的辅音；其它的被忽略。这三个析取从左到右处理，如果词匹配三个部分中的一个，正则表达式后面的部分将被忽略。我们使用re.findall()提取所有匹配的词中的字符，然后使''.join()将它们连接在一起（更多连接操作参见3.9）。

>>> regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
>>> def compress(word):
...     pieces = re.findall(regexp, word)
...     return ''.join(pieces)
...
>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')
>>> print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and

接下来，让我们将正则表达式与条件频率分布结合起来。在这里，我们将从罗托卡特语词汇中提取所有辅音-元音序列，如 ka 和 si。因为每部分都是成对的，它可以被用来初始化一个条件频率分布。然后我们为每对的频率画出表格：

>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
>>> cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
>>> cfd = nltk.ConditionalFreqDist(cvs)
>>> cfd.tabulate()
 a    e    i    o    u
k  418  148   94  420  173
p   83   31  105   34   51
r  187   63   84   89   79
s    0    0  100    2    1
t   47    8    0  148   37
v   93   27  105   48   49

考查 s 行和 t 行，我们看到它们是部分的“互补分布”，这个证据表明它们不是这种语言中的独特音素。从而我们可以令人信服的从罗托卡特语字母表中去除 s，简单加入一个发音规则：当字母 t 跟在 i 后面时发 s 的音。（注意单独的条目 su 即 kasuari，‘cassowary’是从英语中借来的）。

如果我们想要检查表格中数字背后的词汇，有一个索引允许我们迅速找到包含一个给定的辅音-元音对的单词的列表将会有帮助，例如，cv_index['su']应该给我们所有含有 su 的词汇。下面是我们如何能做到这一点：

>>> cv_word_pairs = [(cv, w) for w in rotokas_words
...                          for cv in re.findall(r'[ptksvr][aeiou]', w)]
>>> cv_index = nltk.Index(cv_word_pairs)
>>> cv_index['su']
['kasuari']
>>> cv_index['po']
['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa',
'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', ...]

这段代码依次处理每个词w，对每一个词找出匹配正则表达式«[ptksvr][aeiou]»的所有子字符串。对于词 kasuari，它找到 ka, su 和 ri。因此，cv_word_pairs将包含('ka', 'kasuari'), ('su', 'kasuari')和('ri', 'kasuari')。更进一步使用nltk.Index()转换成有用的索引。

查找词干

在使用网络搜索引擎时，我们通常不介意（甚至没有注意到）文档中的词汇与我们的搜索条件的后缀形式是否相同。查询 laptops 会找到含有 laptop 的文档，反之亦然。事实上，laptop 与 laptops 只是词典中的同一个词（或词条）的两种形式。对于一些语言处理任务，我们想忽略词语结尾，只是处理词干。

抽出一个词的词干的方法有很多种。这里的是一种简单直观的方法，直接去掉任何看起来像一个后缀的字符：

>>> def stem(word):
...     for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...         if word.endswith(suffix):
...             return word[:-len(suffix)]
...     return word

虽然我们最终将使用 NLTK 中内置的词干提取器，看看我们如何能够使用正则表达式处理这个任务是有趣的。我们的第一步是建立一个所有后缀的连接。我们需要把它放在括号内以限制这个析取的范围。

>>> re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
['ing']

在这里，尽管正则表达式匹配整个单词，re.findall()只是给我们后缀。这是因为括号有第二个功能：选择要提取的子字符串。如果我们要使用括号来指定析取的范围，但不想选择要输出的字符串，必须添加?:，它是正则表达式许多神秘奥妙的地方之一。下面是改进后的版本。

>>> re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
['processing']

然而，实际上，我们会想将词分成词干和后缀。所以，我们应该用括号括起正则表达式的这两个部分：

>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
[('process', 'ing')]

这看起来很有用途，但仍然有一个问题。让我们来看看另外的词，processes：

>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('processe', 's')]

正则表达式错误地找到了后缀-s，而不是后缀-es。这表明另一个微妙之处：星号操作符是“贪婪的”，所以表达式的.*部分试图尽可能多的匹配输入的字符串。如果我们使用“非贪婪”版本的“”操作符，写成`?`，我们就得到我们想要的：

>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('process', 'es')]

我们甚至可以通过使第二个括号中的内容变成可选，来得到空后缀：

>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')
[('language', '')]

这种方法仍然有许多问题，（你能发现它们吗？）但我们仍将继续定义一个函数来获取词干，并将它应用到整个文本：

>>> def stem(word):
...     regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
...     stem, suffix = re.findall(regexp, word)[0]
...     return stem
...
>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
>>> tokens = word_tokenize(raw)
>>> [stem(t) for t in tokens]
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut',
'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme',
'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',',
'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

请注意我们的正则表达式不但将 ponds 的 s 删除，也将 is 和 basis 的删除。它产生一些非词如 distribut 和 deriv，但这些在一些应用中是可接受的词干。

搜索已分词文本

你可以使用一种特殊的正则表达式搜索一个文本中多个词（这里的文本是一个词符列表）。例如，"<a> <man>"找出文本中所有 a man 的实例。尖括号用于标记词符的边界，尖括号之间的所有空白都被忽略（这只对 NLTK 中的findall()方法处理文本有效）。在下面的例子中，我们使用<.*>，它将匹配所有单个词符，将它括在括号里，于是只匹配词（例如 monied）而不匹配短语（例如，a monied man）会生成。第二个例子找出以词 bro 结尾的三个词组成的短语。最后一个例子找出以字母 l 开始的三个或更多词组成的序列。

>>> from nltk.corpus import gutenberg, nps_chat
>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
>>> moby.findall(r"<a> (<.*>) <man>") ![[1]](/projects/nlp-py-2e-zh/Images/7e6ea96aad77f3e523494b3972b5a989.jpg)
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> chat = nltk.Text(nps_chat.words())
>>> chat.findall(r"<.*> <.*> <bro>") ![[2]](/projects/nlp-py-2e-zh/Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg)
you rule bro; telling you bro; u twizted bro
>>> chat.findall(r"<l.*>{3,}") ![[3]](/projects/nlp-py-2e-zh/Images/7c20d0adbadb35031a28bfcd6dff9900.jpg)
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la

注意

轮到你来：巩固你对正则表达式模式与替换的理解，使用nltk.re_show(p, s)，它能标注字符串 s 中所有匹配模式 p 的地方，以及nltk.app.nemo()，它能提供一个探索正则表达式的图形界面。更多的练习，可以尝试本章尾的正则表达式的一些练习。

当我们研究的语言现象与特定词语相关时建立搜索模式是很容易的。在某些情况下，一个小小的创意可能会花很大功夫。例如，在大型文本语料库中搜索 x and other ys 形式的表达式能让我们发现上位词（见5）：

>>> from nltk.corpus import brown
>>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
>>> hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")
speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals

只要有足够多的文本，这种做法会给我们一整套有用的分类标准信息，而不需要任何手工劳动。然而，我们的搜索结果中通常会包含误报，即我们想要排除的情况。例如，结果 demands and other factors 暗示 demand 是类型 factor 的一个实例，但是这句话实际上是关于要求增加工资的。尽管如此，我们仍可以通过手工纠正这些搜索的结果来构建自己的英语概念的本体。

注意

这种自动和人工处理相结合的方式是最常见的建造新的语料库的方式。我们将在11.继续讲述这些。

搜索语料也会有遗漏的问题，即漏掉了我们想要包含的情况。仅仅因为我们找不到任何一个搜索模式的实例，就断定一些语言现象在一个语料库中不存在，是很冒险的。也许我们只是没有足够仔细的思考合适的模式。

注意

轮到你来： 查找模式 as x as y 的实例以发现实体及其属性信息。