11. 语言学数据管理 - 4.3 ElementTree 接口 - 《Python 自然语言处理第二版》

4.3 ElementTree 接口

4.3 ElementTree 接口

Python 的 ElementTree 模块提供了一种方便的方式访问存储在 XML 文件中的数据。ElementTree 是 Python 标准库（自从 Python 2.5）的一部分，也作为 NLTK 的一部分提供，以防你在使用 Python 2.4。

我们将使用 XML 格式的莎士比亚戏剧集来说明 ElementTree 的使用方法。让我们加载 XML 文件并检查原始数据，首先在文件的顶部，在那里我们看到一些 XML 头和一个名为play.dtd的模式，接着是根元素 PLAY。我们从 Act 1再次获得数据。（输出中省略了一些空白行。）

>>> merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml')
>>> raw = open(merchant_file).read()
>>> print(raw[:163]) ![[1]](/projects/nlp-py-2e-zh/Images/346344c2e5a627acfdddf948fb69cb1d.jpg)
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="shakes.css"?>
<!-- <!DOCTYPE PLAY SYSTEM "play.dtd"> -->
<PLAY>
<TITLE>The Merchant of Venice</TITLE>
>>> print(raw[1789:2006]) ![[2]](/projects/nlp-py-2e-zh/Images/f9e1ba3246770e3ecb24f813f33f2075.jpg)
<TITLE>ACT I</TITLE>
<SCENE><TITLE>SCENE I.  Venice. A street.</TITLE>
<STAGEDIR>Enter ANTONIO, SALARINO, and SALANIO</STAGEDIR>
<SPEECH>
<SPEAKER>ANTONIO</SPEAKER>
<LINE>In sooth, I know not why I am so sad:</LINE>

我们刚刚访问了作为一个字符串的 XML 数据。正如我们看到的，在 Act 1 开始处的字符串包含 XML 标记 title、scene、stage directions 等。

下一步是作为结构化的 XML 数据使用ElementTree处理文件的内容。我们正在处理一个文件（一个多行字符串），并建立一棵树，所以方法的名称是parse 并不奇怪。变量merchant包含一个 XML 元素PLAY 。此元素有内部结构；我们可以使用一个索引来得到它的第一个孩子，一个TITLE元素。我们还可以看到该元素的文本内容：戏剧的标题。要得到所有的子元素的列表，我们使用getchildren()方法。

>>> from xml.etree.ElementTree import ElementTree
>>> merchant = ElementTree().parse(merchant_file) ![[1]](/projects/nlp-py-2e-zh/Images/346344c2e5a627acfdddf948fb69cb1d.jpg)
>>> merchant
<Element 'PLAY' at 0x10ac43d18> # [_element-play]
>>> merchant[0]
<Element 'TITLE' at 0x10ac43c28> # [_element-title]
>>> merchant[0].text
'The Merchant of Venice' # [_element-text]
>>> merchant.getchildren() ![[5]](/projects/nlp-py-2e-zh/Images/63a8e4c47e813ba9630363f9b203a19a.jpg)
[<Element 'TITLE' at 0x10ac43c28>, <Element 'PERSONAE' at 0x10ac43bd8>,
<Element 'SCNDESCR' at 0x10b067f98>, <Element 'PLAYSUBT' at 0x10af37048>,
<Element 'ACT' at 0x10af37098>, <Element 'ACT' at 0x10b936368>,
<Element 'ACT' at 0x10b934b88>, <Element 'ACT' at 0x10cfd8188>,
<Element 'ACT' at 0x10cfadb38>]

这部戏剧由标题、角色、一个场景的描述、字幕和五幕组成。每一幕都有一个标题和一些场景，每个场景由台词组成，台词由行组成，有四个层次嵌套的结构。让我们深入到第四幕：

>>> merchant[-2][0].text
'ACT IV'
>>> merchant[-2][1]
<Element 'SCENE' at 0x10cfd8228>
>>> merchant[-2][1][0].text
'SCENE I.  Venice. A court of justice.'
>>> merchant[-2][1][54]
<Element 'SPEECH' at 0x10cfb02c8>
>>> merchant[-2][1][54][0]
<Element 'SPEAKER' at 0x10cfb0318>
>>> merchant[-2][1][54][0].text
'PORTIA'
>>> merchant[-2][1][54][1]
<Element 'LINE' at 0x10cfb0368>
>>> merchant[-2][1][54][1].text
"The quality of mercy is not strain'd,"

注意

轮到你来：对语料库中包含的其他莎士比亚戏剧，如《罗密欧与朱丽叶》或《麦克白》，重复上述的一些方法；方法列表请参阅nltk.corpus.shakespeare.fileids()。

虽然我们可以通过这种方式访问整个树，使用特定名称查找子元素会更加方便。回想一下顶层的元素有几种类型。我们可以使用merchant.findall('ACT')遍历我们感兴趣的类型（如幕）。下面是一个做这种特定标记在每一个级别的嵌套搜索的例子：

>>> for i, act in enumerate(merchant.findall('ACT')):
...     for j, scene in enumerate(act.findall('SCENE')):
...         for k, speech in enumerate(scene.findall('SPEECH')):
...             for line in speech.findall('LINE'):
...                 if 'music' in str(line.text):
...                     print("Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text))
Act 3 Scene 2 Speech 9: Let music sound while he doth make his choice;
Act 3 Scene 2 Speech 9: Fading in music: that the comparison
Act 3 Scene 2 Speech 9: And what is music then? Then music is
Act 5 Scene 1 Speech 23: And bring your music forth into the air.
Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of music
Act 5 Scene 1 Speech 23: And draw her home with music.
Act 5 Scene 1 Speech 24: I am never merry when I hear sweet music.
Act 5 Scene 1 Speech 25: Or any air of music touch their ears,
Act 5 Scene 1 Speech 25: By the sweet power of music: therefore the poet
Act 5 Scene 1 Speech 25: But music for the time doth change his nature.
Act 5 Scene 1 Speech 25: The man that hath no music in himself,
Act 5 Scene 1 Speech 25: Let no such man be trusted. Mark the music.
Act 5 Scene 1 Speech 29: It is your music, madam, of the house.
Act 5 Scene 1 Speech 32: No better a musician than the wren.

不是沿着层次结构向下遍历每一级，我们可以寻找特定的嵌入的元素。例如，让我们来看看演员的顺序。我们可以使用频率分布看看谁最能说：

>>> from collections import Counter
>>> speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER')]
>>> speaker_freq = Counter(speaker_seq)
>>> top5 = speaker_freq.most_common(5)
>>> top5
[('PORTIA', 117), ('SHYLOCK', 79), ('BASSANIO', 73),
('GRATIANO', 48), ('LORENZO', 47)]

我们也可以查看对话中谁跟着谁的模式。由于有 23 个演员，我们需要首先使用3中描述的方法将“词汇”减少到可处理的大小。

>>> from collections import defaultdict
>>> abbreviate = defaultdict(lambda: 'OTH')
>>> for speaker, _ in top5:
...     abbreviate[speaker] = speaker[:4]
...
>>> speaker_seq2 = [abbreviate[speaker] for speaker in speaker_seq]
>>> cfd = nltk.ConditionalFreqDist(nltk.bigrams(speaker_seq2))
>>> cfd.tabulate()
 ANTO BASS GRAT  OTH PORT SHYL
ANTO    0   11    4   11    9   12
BASS   10    0   11   10   26   16
GRAT    6    8    0   19    9    5
 OTH    8   16   18  153   52   25
PORT    7   23   13   53    0   21
SHYL   15   15    2   26   21    0

忽略 153 的条目，因为是前五位角色（标记为OTH）之间相互对话，最大的值表示 Othello 和 Portia 的相互对话最多。