4.3 ElementTree 接口

Python 的 ElementTree 模块提供了一种方便的方式访问存储在 XML 文件中的数据。ElementTree 是 Python 标准库(自从 Python 2.5)的一部分,也作为 NLTK 的一部分提供,以防你在使用 Python 2.4。

我们将使用 XML 格式的莎士比亚戏剧集来说明 ElementTree 的使用方法。让我们加载 XML 文件并检查原始数据,首先在文件的顶部[1],在那里我们看到一些 XML 头和一个名为play.dtd的模式,接着是根元素 PLAY。我们从 Act 1[2]再次获得数据。(输出中省略了一些空白行。)

  1. >>> merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml')
  2. >>> raw = open(merchant_file).read()
  3. >>> print(raw[:163]) ![[1]](/projects/nlp-py-2e-zh/Images/346344c2e5a627acfdddf948fb69cb1d.jpg)
  4. <?xml version="1.0"?>
  5. <?xml-stylesheet type="text/css" href="shakes.css"?>
  6. <!-- <!DOCTYPE PLAY SYSTEM "play.dtd"> -->
  7. <PLAY>
  8. <TITLE>The Merchant of Venice</TITLE>
  9. >>> print(raw[1789:2006]) ![[2]](/projects/nlp-py-2e-zh/Images/f9e1ba3246770e3ecb24f813f33f2075.jpg)
  10. <TITLE>ACT I</TITLE>
  11. <SCENE><TITLE>SCENE I. Venice. A street.</TITLE>
  12. <STAGEDIR>Enter ANTONIO, SALARINO, and SALANIO</STAGEDIR>
  13. <SPEECH>
  14. <SPEAKER>ANTONIO</SPEAKER>
  15. <LINE>In sooth, I know not why I am so sad:</LINE>

我们刚刚访问了作为一个字符串的 XML 数据。正如我们看到的,在 Act 1 开始处的字符串包含 XML 标记 title、scene、stage directions 等。

下一步是作为结构化的 XML 数据使用ElementTree处理文件的内容。我们正在处理一个文件(一个多行字符串),并建立一棵树,所以方法的名称是parse [1]并不奇怪。变量merchant包含一个 XML 元素PLAY [2]。此元素有内部结构;我们可以使用一个索引来得到它的第一个孩子,一个TITLE元素[3]。我们还可以看到该元素的文本内容:戏剧的标题[4]。要得到所有的子元素的列表,我们使用getchildren()方法[5]

  1. >>> from xml.etree.ElementTree import ElementTree
  2. >>> merchant = ElementTree().parse(merchant_file) ![[1]](/projects/nlp-py-2e-zh/Images/346344c2e5a627acfdddf948fb69cb1d.jpg)
  3. >>> merchant
  4. <Element 'PLAY' at 0x10ac43d18> # [_element-play]
  5. >>> merchant[0]
  6. <Element 'TITLE' at 0x10ac43c28> # [_element-title]
  7. >>> merchant[0].text
  8. 'The Merchant of Venice' # [_element-text]
  9. >>> merchant.getchildren() ![[5]](/projects/nlp-py-2e-zh/Images/63a8e4c47e813ba9630363f9b203a19a.jpg)
  10. [<Element 'TITLE' at 0x10ac43c28>, <Element 'PERSONAE' at 0x10ac43bd8>,
  11. <Element 'SCNDESCR' at 0x10b067f98>, <Element 'PLAYSUBT' at 0x10af37048>,
  12. <Element 'ACT' at 0x10af37098>, <Element 'ACT' at 0x10b936368>,
  13. <Element 'ACT' at 0x10b934b88>, <Element 'ACT' at 0x10cfd8188>,
  14. <Element 'ACT' at 0x10cfadb38>]

这部戏剧由标题、角色、一个场景的描述、字幕和五幕组成。每一幕都有一个标题和一些场景,每个场景由台词组成,台词由行组成,有四个层次嵌套的结构。让我们深入到第四幕:

  1. >>> merchant[-2][0].text
  2. 'ACT IV'
  3. >>> merchant[-2][1]
  4. <Element 'SCENE' at 0x10cfd8228>
  5. >>> merchant[-2][1][0].text
  6. 'SCENE I. Venice. A court of justice.'
  7. >>> merchant[-2][1][54]
  8. <Element 'SPEECH' at 0x10cfb02c8>
  9. >>> merchant[-2][1][54][0]
  10. <Element 'SPEAKER' at 0x10cfb0318>
  11. >>> merchant[-2][1][54][0].text
  12. 'PORTIA'
  13. >>> merchant[-2][1][54][1]
  14. <Element 'LINE' at 0x10cfb0368>
  15. >>> merchant[-2][1][54][1].text
  16. "The quality of mercy is not strain'd,"

注意

轮到你来:对语料库中包含的其他莎士比亚戏剧,如《罗密欧与朱丽叶》或《麦克白》,重复上述的一些方法;方法列表请参阅nltk.corpus.shakespeare.fileids()

虽然我们可以通过这种方式访问整个树,使用特定名称查找子元素会更加方便。回想一下顶层的元素有几种类型。我们可以使用merchant.findall('ACT')遍历我们感兴趣的类型(如幕)。下面是一个做这种特定标记在每一个级别的嵌套搜索的例子:

  1. >>> for i, act in enumerate(merchant.findall('ACT')):
  2. ... for j, scene in enumerate(act.findall('SCENE')):
  3. ... for k, speech in enumerate(scene.findall('SPEECH')):
  4. ... for line in speech.findall('LINE'):
  5. ... if 'music' in str(line.text):
  6. ... print("Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text))
  7. Act 3 Scene 2 Speech 9: Let music sound while he doth make his choice;
  8. Act 3 Scene 2 Speech 9: Fading in music: that the comparison
  9. Act 3 Scene 2 Speech 9: And what is music then? Then music is
  10. Act 5 Scene 1 Speech 23: And bring your music forth into the air.
  11. Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of music
  12. Act 5 Scene 1 Speech 23: And draw her home with music.
  13. Act 5 Scene 1 Speech 24: I am never merry when I hear sweet music.
  14. Act 5 Scene 1 Speech 25: Or any air of music touch their ears,
  15. Act 5 Scene 1 Speech 25: By the sweet power of music: therefore the poet
  16. Act 5 Scene 1 Speech 25: But music for the time doth change his nature.
  17. Act 5 Scene 1 Speech 25: The man that hath no music in himself,
  18. Act 5 Scene 1 Speech 25: Let no such man be trusted. Mark the music.
  19. Act 5 Scene 1 Speech 29: It is your music, madam, of the house.
  20. Act 5 Scene 1 Speech 32: No better a musician than the wren.

不是沿着层次结构向下遍历每一级,我们可以寻找特定的嵌入的元素。例如,让我们来看看演员的顺序。我们可以使用频率分布看看谁最能说:

  1. >>> from collections import Counter
  2. >>> speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER')]
  3. >>> speaker_freq = Counter(speaker_seq)
  4. >>> top5 = speaker_freq.most_common(5)
  5. >>> top5
  6. [('PORTIA', 117), ('SHYLOCK', 79), ('BASSANIO', 73),
  7. ('GRATIANO', 48), ('LORENZO', 47)]

我们也可以查看对话中谁跟着谁的模式。由于有 23 个演员,我们需要首先使用3中描述的方法将“词汇”减少到可处理的大小。

  1. >>> from collections import defaultdict
  2. >>> abbreviate = defaultdict(lambda: 'OTH')
  3. >>> for speaker, _ in top5:
  4. ... abbreviate[speaker] = speaker[:4]
  5. ...
  6. >>> speaker_seq2 = [abbreviate[speaker] for speaker in speaker_seq]
  7. >>> cfd = nltk.ConditionalFreqDist(nltk.bigrams(speaker_seq2))
  8. >>> cfd.tabulate()
  9. ANTO BASS GRAT OTH PORT SHYL
  10. ANTO 0 11 4 11 9 12
  11. BASS 10 0 11 10 26 16
  12. GRAT 6 8 0 19 9 5
  13. OTH 8 16 18 153 52 25
  14. PORT 7 23 13 53 0 21
  15. SHYL 15 15 2 26 21 0

忽略 153 的条目,因为是前五位角色(标记为OTH)之间相互对话,最大的值表示 Othello 和 Portia 的相互对话最多。