3 处理原始文本 - 3.9 格式化：从列表到字符串 - 《Python 自然语言处理第二版》

3.9 格式化：从列表到字符串

3.9 格式化：从列表到字符串

我们经常会写程序来汇报一个单独的数据项例如一个语料库中满足一些复杂的标准的特定的元素，或者一个单独的总数统计例如一个词计数器或一个标注器的性能。更多的时候，我们写程序来产生一个结构化的结果；例如：一个数字或语言形式的表格，或原始数据的格式变换。当要表示的结果是语言时，文字输出通常是最自然的选择。然而当结果是数值时，可能最好是图形输出。在本节中，你将会学到呈现程序输出的各种方式。

从列表到字符串

我们用于文本处理的最简单的一种结构化对象是词列表。当我们希望把这些输出到显示器或文件时，必须把这些词列表转换成字符串。在 Python 做这些，我们使用join()方法，并指定字符串作为使用的“胶水”。

>>> silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
>>> ' '.join(silly)
'We called him Tortoise because he taught us .'
>>> ';'.join(silly)
'We;called;him;Tortoise;because;he;taught;us;.'
>>> ''.join(silly)
'WecalledhimTortoisebecausehetaughtus.'

所以' '.join(silly)的意思是：取出silly中的所有项目，将它们连接成一个大的字符串，使用' '作为项目之间的间隔符。即join()是一个你想要用来作为胶水的字符串的一个方法。（许多人感到join()的这种表示方法是违反直觉的。）join()方法只适用于一个字符串的列表——我们一直把它叫做一个文本——在 Python 中享有某些特权的一个复杂类型。

字符串与格式

我们已经看到了有两种方式显示一个对象的内容：

>>> word = 'cat'
>>> sentence = """hello
... world"""
>>> print(word)
cat
>>> print(sentence)
hello
world
>>> word
'cat'
>>> sentence
'hello\nworld'

print命令让 Python 努力以人最可读的形式输出的一个对象的内容。第二种方法——叫做变量提示——向我们显示可用于重新创建该对象的字符串。重要的是要记住这些都仅仅是字符串，为了你用户的方便而显示的。它们并不会给我们实际对象的内部表示的任何线索。

还有许多其他有用的方法来将一个对象作为字符串显示。这可能是为了人阅读的方便，或是因为我们希望导出我们的数据到一个特定的能被外部程序使用的文件格式。

格式化输出通常包含变量和预先指定的字符串的一个组合，例如给定一个频率分布fdist，我们可以这样做：

>>> fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
>>> for word in sorted(fdist):
...     print(word, '->', fdist[word], end='; ')
cat -> 3; dog -> 4; snake -> 1;

输出包含变量和常量交替出现的表达式是难以阅读和维护的。一个更好的解决办法是使用字符串格式化表达式。

>>> for word in sorted(fdist):
...    print('{}->{};'.format(word, fdist[word]), end=' ')
cat->3; dog->4; snake->1;

要了解这里发生了什么事情，让我们在字符串格式化表达式上面测试一下。（现在，这将是你探索新语法的常用方法。）

>>> '{}->{};'.format ('cat', 3)
'cat->3;'

花括号'{}'标记一个替换字段的出现：它作为传递给str.format()方法的对象的字符串值的占位符。我们可以将'{}'嵌入到一个字符串的内部，然后以适当的参数调用format()来让字符串替换它们。包含替换字段的字符串叫做格式字符串。

让我们更深入的解开这段代码，以便更仔细的观察它的行为：

>>> '{}->'.format('cat')
'cat->'
>>> '{}'.format(3)
'3'
>>> 'I want a {} right now'.format('coffee')
'I want a coffee right now'

我们可以有任意个数目的占位符，但str.format方法必须以数目完全相同的参数来调用。

>>> '{} wants a {} {}'.format ('Lee', 'sandwich', 'for lunch')
'Lee wants a sandwich for lunch'
>>> '{} wants a {} {}'.format ('sandwich', 'for lunch')
Traceback (most recent call last):
...
 '{} wants a {} {}'.format ('sandwich', 'for lunch')
IndexError: tuple index out of range

从左向右取用给format()的参数，任何多余的参数都会被简单地忽略。

System Message: ERROR/3 (ch03.rst2, line 2265)

Unexpected indentation.

>>> '{} wants a {}'.format ('Lee', 'sandwich', 'for lunch')
'Lee wants a sandwich'

格式字符串中的替换字段可以以一个数值开始，它表示format()的位置参数。'from {} to {}'这样的语句等同于'from {0} to {1}'，但是我们使用数字来得到非默认的顺序：

>>> 'from {1} to {0}'.format('A', 'B')
'from B to A'

我们还可以间接提供值给占位符。下面是使用for循环的一个例子：

>>> template = 'Lee wants a {} right now'
>>> menu = ['sandwich', 'spam fritter', 'pancake']
>>> for snack in menu:
...     print(template.format(snack))
...
Lee wants a sandwich right now
Lee wants a spam fritter right now
Lee wants a pancake right now

对齐

到目前为止，我们的格式化字符串可以在页面（或屏幕）上输出任意的宽度。我们可以通过插入一个冒号':'跟随一个整数来添加空白以获得指定宽带的输出。所以{:6}表示我们想让字符串对齐到宽度 6。数字默认表示右对齐，单我们可以在宽度指示符前面加上'<'对齐选项来让数字左对齐。

>>> '{:6}'.format(41) ![[1]](/projects/nlp-py-2e-zh/Images/7e6ea96aad77f3e523494b3972b5a989.jpg)
'    41'
>>> '{:<6}' .format(41) ![[2]](/projects/nlp-py-2e-zh/Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg)
'41    '

字符串默认是左对齐，但可以通过'>'对齐选项右对齐。

System Message: ERROR/3 (ch03.rst2, line 2313)

Unexpected indentation.

>>> '{:6}'.format('dog') ![[1]](/projects/nlp-py-2e-zh/Images/7e6ea96aad77f3e523494b3972b5a989.jpg)
'dog   '
>>> '{:>6}'.format('dog') ![[2]](/projects/nlp-py-2e-zh/Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg)
 '   dog'

其它控制字符可以用于指定浮点数的符号和精度；例如{:.4f}表示浮点数的小数点后面应该显示 4 个数字。

>>> import math
>>> '{:.4f}'.format(math.pi)
'3.1416'

字符串格式化很聪明，能够知道如果你包含一个'%'在你的格式化字符串中，那么你想表示这个值为百分数；不需要乘以 100。

>>> count, total = 3205, 9375
>>> "accuracy for {} words: {:.4%}".format(total, count / total)
'accuracy for 9375 words: 34.1867%'

格式化字符串的一个重要用途是用于数据制表。回想一下，在1中，我们看到从条件频率分布中制表的数据。让我们自己来制表，行使对标题和列宽的完全控制，如3.11所示。注意语言处理工作与结果制表之间是明确分离的。

def tabulate(cfdist, words, categories):
    print('{:16}'.format('Category'), end=' ')                    # column headings
    for word in words:
        print('{:>6}'.format(word), end=' ')
    print()
    for category in categories:
        print('{:16}'.format(category), end=' ')                  # row heading
        for word in words:                                        # for each word
            print('{:6}'.format(cfdist[category][word]), end=' ') # print table cell
        print()                                                   # end the row
>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
...           (genre, word)
...           for genre in brown.categories()
...           for word in brown.words(categories=genre))
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> tabulate(cfd, modals, genres)
Category            can  could    may  might   must   will
news                 93     86     66     38     50    389
religion             82     59     78     12     54     71
hobbies             268     58    131     22     83    264
science_fiction      16     49      4     12      8     16
romance              74    193     11     51     45     43
humor                16     30      8      8      9     13

回想一下3.6中的列表，我们使用格式字符串'{:{width}}'并绑定一个值给 format()中的width参数。这我们使用变量知道字段的宽度。

>>> '{:{width}}' % ("Monty Python", width=15)
'Monty Python   '

我们可以使用width = max(len(w) for w in words)自动定制列的宽度，使其足够容纳所有的词。

将结果写入文件

我们已经看到了如何读取文本文件（3.1）。将输出写入文件往往也很有用。下面的代码打开可写文件output.txt，将程序的输出保存到文件。

>>> output_file = open('output.txt', 'w')
>>> words = set(nltk.corpus.genesis.words('english-kjv.txt'))
>>> for word in sorted(words):
...     print(word, file=output_file)

当我们将非文本数据写入文件时，我们必须先将它转换为字符串。正如我们前面所看到的，可以使用格式化字符串来做这一转换。让我们把总词数写入我们的文件：

>>> len(words)
2789
>>> str(len(words))
'2789'
>>> print(str(len(words)), file=output_file)

小心！

你应该避免包含空格字符的文件名例如output file.txt，和除了大小写外完全相同的文件名，例如Output.txt和output.TXT。

文本换行

当程序的输出是文档式的而不是像表格时，通常会有必要包装一下以便可以方便地显示它。考虑下面的输出，它的行尾溢出了，且使用了一个复杂的print语句：

>>> saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
...           'more', 'is', 'said', 'than', 'done', '.']
>>> for word in saying:
...     print(word, '(' + str(len(word)) + '),', end=' ')
After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1),

我们可以在 Python 的textwrap模块的帮助下采取换行。为了最大程度的清晰，我们将每一个步骤分在一行：

>>> from textwrap import fill
>>> format = '%s (%d),'
>>> pieces = [format % (word, len(word)) for word in saying]
>>> output = ' '.join(pieces)
>>> wrapped = fill(output)
>>> print(wrapped)
After (5), all (3), is (2), said (4), and (3), done (4), , (1), more
(4), is (2), said (4), than (4), done (4), . (1),

请注意，在more与其下面的数字之间有一个换行符。如果我们希望避免这种情况，可以重新定义格式化字符串，使它不包含空格（例如'%s_(%d),'，然后不输出wrapped的值，我们可以输出wrapped.replace('_', ' ')。