14.1 来自Bitly的USA.gov数据

2011年,URL缩短服务Bitly跟美国政府网站USA.gov合作,提供了一份从生成.gov或.mil短链接的用户那里收集来的匿名数据。在2011年,除实时数据之外,还可以下载文本文件形式的每小时快照。写作此书时(2017年),这项服务已经关闭,但我们保存一份数据用于本书的案例。

以每小时快照为例,文件中各行的格式为JSON(即JavaScript Object Notation,这是一种常用的Web数据格式)。例如,如果我们只读取某个文件中的第一行,那么所看到的结果应该是下面这样:

  1. In [5]: path = 'datasets/bitly_usagov/example.txt'
  2. In [6]: open(path).readline()
  3. Out[6]: '{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11
  4. (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1,
  5. "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l":
  6. "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r":
  7. "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":
  8. "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc":
  9. 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

Python有内置或第三方模块可以将JSON字符串转换成Python字典对象。这里,我将使用json模块及其loads函数逐行加载已经下载好的数据文件:

  1. import json
  2. path = 'datasets/bitly_usagov/example.txt'
  3. records = [json.loads(line) for line in open(path)]

现在,records对象就成为一组Python字典了:

  1. In [18]: records[0]
  2. Out[18]:
  3. {'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko)
  4. Chrome/17.0.963.78 Safari/535.11',
  5. 'al': 'en-US,en;q=0.8',
  6. 'c': 'US',
  7. 'cy': 'Danvers',
  8. 'g': 'A6qOVH',
  9. 'gr': 'MA',
  10. 'h': 'wfLQtf',
  11. 'hc': 1331822918,
  12. 'hh': '1.usa.gov',
  13. 'l': 'orofrog',
  14. 'll': [42.576698, -70.954903],
  15. 'nk': 1,
  16. 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
  17. 't': 1331923247,
  18. 'tz': 'America/New_York',
  19. 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

用纯Python代码对时区进行计数

假设我们想要知道该数据集中最常出现的是哪个时区(即tz字段),得到答案的办法有很多。首先,我们用列表推导式取出一组时区:

  1. In [12]: time_zones = [rec['tz'] for rec in records]
  2. ---------------------------------------------------------------------------
  3. KeyError Traceback (most recent call last)
  4. <ipython-input-12-db4fbd348da9> in <module>()
  5. ----> 1 time_zones = [rec['tz'] for rec in records]
  6. <ipython-input-12-db4fbd348da9> in <listcomp>(.0)
  7. ----> 1 time_zones = [rec['tz'] for rec in records]
  8. KeyError: 'tz'

晕!原来并不是所有记录都有时区字段。这个好办,只需在列表推导式末尾加上一个if ‘tz’in rec判断即可:

  1. In [13]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]
  2. In [14]: time_zones[:10]
  3. Out[14]:
  4. ['America/New_York',
  5. 'America/Denver',
  6. 'America/New_York',
  7. 'America/Sao_Paulo',
  8. 'America/New_York',
  9. 'America/New_York',
  10. 'Europe/Warsaw',
  11. '',
  12. '',
  13. '']

只看前10个时区,我们发现有些是未知的(即空的)。虽然可以将它们过滤掉,但现在暂时先留着。接下来,为了对时区进行计数,这里介绍两个办法:一个较难(只使用标准Python库),另一个较简单(使用pandas)。计数的办法之一是在遍历时区的过程中将计数值保存在字典中:

  1. def get_counts(sequence):
  2. counts = {}
  3. for x in sequence:
  4. if x in counts:
  5. counts[x] += 1
  6. else:
  7. counts[x] = 1
  8. return counts

如果使用Python标准库的更高级工具,那么你可能会将代码写得更简洁一些:

  1. from collections import defaultdict
  2. def get_counts2(sequence):
  3. counts = defaultdict(int) # values will initialize to 0
  4. for x in sequence:
  5. counts[x] += 1
  6. return counts

我将逻辑写到函数中是为了获得更高的复用性。要用它对时区进行处理,只需将time_zones传入即可:

  1. In [17]: counts = get_counts(time_zones)
  2. In [18]: counts['America/New_York']
  3. Out[18]: 1251
  4. In [19]: len(time_zones)
  5. Out[19]: 3440

如果想要得到前10位的时区及其计数值,我们需要用到一些有关字典的处理技巧:

  1. def top_counts(count_dict, n=10):
  2. value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
  3. value_key_pairs.sort()
  4. return value_key_pairs[-n:]

然后有:

  1. In [21]: top_counts(counts)
  2. Out[21]:
  3. [(33, 'America/Sao_Paulo'),
  4. (35, 'Europe/Madrid'),
  5. (36, 'Pacific/Honolulu'),
  6. (37, 'Asia/Tokyo'),
  7. (74, 'Europe/London'),
  8. (191, 'America/Denver'),
  9. (382, 'America/Los_Angeles'),
  10. (400, 'America/Chicago'),
  11. (521, ''),
  12. (1251, 'America/New_York')]

如果你搜索Python的标准库,你能找到collections.Counter类,它可以使这项工作更简单:

  1. In [22]: from collections import Counter
  2. In [23]: counts = Counter(time_zones)
  3. In [24]: counts.most_common(10)
  4. Out[24]:
  5. [('America/New_York', 1251),
  6. ('', 521),
  7. ('America/Chicago', 400),
  8. ('America/Los_Angeles', 382),
  9. ('America/Denver', 191),
  10. ('Europe/London', 74),
  11. ('Asia/Tokyo', 37),
  12. ('Pacific/Honolulu', 36),
  13. ('Europe/Madrid', 35),
  14. ('America/Sao_Paulo', 33)]

用pandas对时区进行计数

从原始记录的集合创建DateFrame,与将记录列表传递到pandas.DataFrame一样简单:

  1. In [25]: import pandas as pd
  2. In [26]: frame = pd.DataFrame(records)
  3. In [27]: frame.info()
  4. <class 'pandas.core.frame.DataFrame'>
  5. RangeIndex: 3560 entries, 0 to 3559
  6. Data columns (total 18 columns):
  7. _heartbeat_ 120 non-null float64
  8. a 3440 non-null object
  9. al 3094 non-null object
  10. c 2919 non-null object
  11. cy 2919 non-null object
  12. g 3440 non-null object
  13. gr 2919 non-null object
  14. h 3440 non-null object
  15. hc 3440 non-null float64
  16. hh 3440 non-null object
  17. kw 93 non-null object
  18. l 3440 non-null object
  19. ll 2919 non-null object
  20. nk 3440 non-null float64
  21. r 3440 non-null object
  22. t 3440 non-null float64
  23. tz 3440 non-null object
  24. u 3440 non-null object
  25. dtypes: float64(4), object(14)
  26. memory usage: 500.7+ KB
  27. In [28]: frame['tz'][:10]
  28. Out[28]:
  29. 0 America/New_York
  30. 1 America/Denver
  31. 2 America/New_York
  32. 3 America/Sao_Paulo
  33. 4 America/New_York
  34. 5 America/New_York
  35. 6 Europe/Warsaw
  36. 7
  37. 8
  38. 9
  39. Name: tz, dtype: object

这里frame的输出形式是摘要视图(summary view),主要用于较大的DataFrame对象。我们然后可以对Series使用value_counts方法:

  1. In [29]: tz_counts = frame['tz'].value_counts()
  2. In [30]: tz_counts[:10]
  3. Out[30]:
  4. America/New_York 1251
  5. 521
  6. America/Chicago 400
  7. America/Los_Angeles 382
  8. America/Denver 191
  9. Europe/London 74
  10. Asia/Tokyo 37
  11. Pacific/Honolulu 36
  12. Europe/Madrid 35
  13. America/Sao_Paulo 33
  14. Name: tz, dtype: int64

我们可以用matplotlib可视化这个数据。为此,我们先给记录中未知或缺失的时区填上一个替代值。fillna函数可以替换缺失值(NA),而未知值(空字符串)则可以通过布尔型数组索引加以替换:

  1. In [31]: clean_tz = frame['tz'].fillna('Missing')
  2. In [32]: clean_tz[clean_tz == ''] = 'Unknown'
  3. In [33]: tz_counts = clean_tz.value_counts()
  4. In [34]: tz_counts[:10]
  5. Out[34]:
  6. America/New_York 1251
  7. Unknown 521
  8. America/Chicago 400
  9. America/Los_Angeles 382
  10. America/Denver 191
  11. Missing 120
  12. Europe/London 74
  13. Asia/Tokyo 37
  14. Pacific/Honolulu 36
  15. Europe/Madrid 35
  16. Name: tz, dtype: int64

此时,我们可以用seaborn包创建水平柱状图(结果见图14-1):

  1. In [36]: import seaborn as sns
  2. In [37]: subset = tz_counts[:10]
  3. In [38]: sns.barplot(y=subset.index, x=subset.values)

图14-1 usa.gov示例数据中最常出现的时区

a字段含有执行URL短缩操作的浏览器、设备、应用程序的相关信息:

  1. In [39]: frame['a'][1]
  2. Out[39]: 'GoogleMaps/RochesterNY'
  3. In [40]: frame['a'][50]
  4. Out[40]: 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2)
  5. Gecko/20100101 Firefox/10.0.2'
  6. In [41]: frame['a'][51][:50] # long line
  7. Out[41]: 'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P9'

将这些”agent”字符串中的所有信息都解析出来是一件挺郁闷的工作。一种策略是将这种字符串的第一节(与浏览器大致对应)分离出来并得到另外一份用户行为摘要:

  1. In [42]: results = pd.Series([x.split()[0] for x in frame.a.dropna()])
  2. In [43]: results[:5]
  3. Out[43]:
  4. 0 Mozilla/5.0
  5. 1 GoogleMaps/RochesterNY
  6. 2 Mozilla/4.0
  7. 3 Mozilla/5.0
  8. 4 Mozilla/5.0
  9. dtype: object
  10. In [44]: results.value_counts()[:8]
  11. Out[44]:
  12. Mozilla/5.0 2594
  13. Mozilla/4.0 601
  14. GoogleMaps/RochesterNY 121
  15. Opera/9.80 34
  16. TEST_INTERNET_AGENT 24
  17. GoogleProducer 21
  18. Mozilla/6.0 5
  19. BlackBerry8520/5.0.0.681 4
  20. dtype: int64

现在,假设你想按Windows和非Windows用户对时区统计信息进行分解。为了简单起见,我们假定只要agent字符串中含有”Windows”就认为该用户为Windows用户。由于有的agent缺失,所以首先将它们从数据中移除:

  1. In [45]: cframe = frame[frame.a.notnull()]

然后计算出各行是否含有Windows的值:

  1. In [47]: cframe['os'] = np.where(cframe['a'].str.contains('Windows'),
  2. ....: 'Windows', 'Not Windows')
  3. In [48]: cframe['os'][:5]
  4. Out[48]:
  5. 0 Windows
  6. 1 Not Windows
  7. 2 Windows
  8. 3 Not Windows
  9. 4 Windows
  10. Name: os, dtype: object

接下来就可以根据时区和新得到的操作系统列表对数据进行分组了:

  1. In [49]: by_tz_os = cframe.groupby(['tz', 'os'])

分组计数,类似于value_counts函数,可以用size来计算。并利用unstack对计数结果进行重塑:

  1. In [50]: agg_counts = by_tz_os.size().unstack().fillna(0)
  2. In [51]: agg_counts[:10]
  3. Out[51]:
  4. os Not Windows Windows
  5. tz
  6. 245.0 276.0
  7. Africa/Cairo 0.0 3.0
  8. Africa/Casablanca 0.0 1.0
  9. Africa/Ceuta 0.0 2.0
  10. Africa/Johannesburg 0.0 1.0
  11. Africa/Lusaka 0.0 1.0
  12. America/Anchorage 4.0 1.0
  13. America/Argentina/Buenos_Aires 1.0 0.0
  14. America/Argentina/Cordoba 0.0 1.0
  15. America/Argentina/Mendoza 0.0 1.0

最后,我们来选取最常出现的时区。为了达到这个目的,我根据agg_counts中的行数构造了一个间接索引数组:

  1. # Use to sort in ascending order
  2. In [52]: indexer = agg_counts.sum(1).argsort()
  3. In [53]: indexer[:10]
  4. Out[53]:
  5. tz
  6. 24
  7. Africa/Cairo 20
  8. Africa/Casablanca 21
  9. Africa/Ceuta 92
  10. Africa/Johannesburg 87
  11. Africa/Lusaka 53
  12. America/Anchorage 54
  13. America/Argentina/Buenos_Aires 57
  14. America/Argentina/Cordoba 26
  15. America/Argentina/Mendoza 55
  16. dtype: int64

然后我通过take按照这个顺序截取了最后10行最大值:

  1. In [54]: count_subset = agg_counts.take(indexer[-10:])
  2. In [55]: count_subset
  3. Out[55]:
  4. os Not Windows Windows
  5. tz
  6. America/Sao_Paulo 13.0 20.0
  7. Europe/Madrid 16.0 19.0
  8. Pacific/Honolulu 0.0 36.0
  9. Asia/Tokyo 2.0 35.0
  10. Europe/London 43.0 31.0
  11. America/Denver 132.0 59.0
  12. America/Los_Angeles 130.0 252.0
  13. America/Chicago 115.0 285.0
  14. 245.0 276.0
  15. America/New_York 339.0 912.0

pandas有一个简便方法nlargest,可以做同样的工作:

  1. In [56]: agg_counts.sum(1).nlargest(10)
  2. Out[56]:
  3. tz
  4. America/New_York 1251.0
  5. 521.0
  6. America/Chicago 400.0
  7. America/Los_Angeles 382.0
  8. America/Denver 191.0
  9. Europe/London 74.0
  10. Asia/Tokyo 37.0
  11. Pacific/Honolulu 36.0
  12. Europe/Madrid 35.0
  13. America/Sao_Paulo 33.0
  14. dtype: float64

然后,如这段代码所示,可以用柱状图表示。我传递一个额外参数到seaborn的barpolt函数,来画一个堆积条形图(见图14-2):

  1. # Rearrange the data for plotting
  2. In [58]: count_subset = count_subset.stack()
  3. In [59]: count_subset.name = 'total'
  4. In [60]: count_subset = count_subset.reset_index()
  5. In [61]: count_subset[:10]
  6. Out[61]:
  7. tz os total
  8. 0 America/Sao_Paulo Not Windows 13.0
  9. 1 America/Sao_Paulo Windows 20.0
  10. 2 Europe/Madrid Not Windows 16.0
  11. 3 Europe/Madrid Windows 19.0
  12. 4 Pacific/Honolulu Not Windows 0.0
  13. 5 Pacific/Honolulu Windows 36.0
  14. 6 Asia/Tokyo Not Windows 2.0
  15. 7 Asia/Tokyo Windows 35.0
  16. 8 Europe/London Not Windows 43.0
  17. 9 Europe/London Windows 31.0
  18. In [62]: sns.barplot(x='total', y='tz', hue='os', data=count_subset)

图14-2 最常出现时区的Windows和非Windows用户

这张图不容易看出Windows用户在小分组中的相对比例,因此标准化分组百分比之和为1:

  1. def norm_total(group):
  2. group['normed_total'] = group.total / group.total.sum()
  3. return group
  4. results = count_subset.groupby('tz').apply(norm_total)

再次画图,见图14-3:

  1. In [65]: sns.barplot(x='normed_total', y='tz', hue='os', data=results)

图14-3 最常出现时区的Windows和非Windows用户的百分比

我们还可以用groupby的transform方法,更高效的计算标准化的和:

  1. In [66]: g = count_subset.groupby('tz')
  2. In [67]: results2 = count_subset.total / g.total.transform('sum')