3.1.1.2 panda data-frame

我们将会在来自pandas模块的pandas.DataFrame中存储和操作这个数据。它是电子表格程序在Python中的一个等价物。它与2D numpy数据的区别在于列带有名字,可以在列中存储混合的数据类型,并且有精妙的选择和透视表机制。

3.1.1.2.1 创建dataframes: 读取数据文件或转化数组

从CSV文件读取: 使用上面的CSV文件,给出了大脑大小重量和IQ (Willerman et al. 1991) 的观察值 , 数据混合了数量值和类型值:

In [3]:

  1. import pandas
  2. data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=".")
  3. data

Out[3]:

Unnamed: 0GenderFSIQVIQPIQWeightHeightMRI_Count
01Female13313212411864.5816932
12Male140150124NaN72.51001121
23Male13912315014373.31038437
34Male13312912817268.8965353
45Female13713213414765.0951545
56Female999011014669.0928799
67Female13813613113864.5991305
78Female92909817566.0854258
89Male89938413466.3904858
910Male13311414717268.8955466
1011Female13212912411864.5833868
1112Male14115012815170.01079549
1213Male13512912415569.0924059
1314Female14012014715570.5856472
1415Female961009014666.0878897
1516Female83719613568.0865363
1617Female13213212012768.5852244
1718Male1009610217873.5945088
1819Female1011128413666.3808020
1920Male80778618070.0889083
2021Male838386NaNNaN892420
2122Male971078418676.5905940
2223Female13512913412262.0790619
2324Male13914512813268.0955003
2425Female918610211463.0831772
2526Male14114513117172.0935494
2627Female85908414068.0798612
2728Male1039611018777.01062462
2829Female77837210663.0793549
2930Female13012612415966.5866662
3031Female13312613212762.5857782
3132Male14414513719167.0949589
3233Male1039611019275.5997925
3334Male90968618169.0879987
3435Female83908114366.5834344
3536Female13312912815366.5948066
3637Male14015012414470.5949395
3738Female88869413964.5893983
3839Male81907414874.0930016
3940Male89918917975.5935863

分割符 它是CSV文件,但是分割符是”;”

缺失值 CSV中的第二个个体的weight是缺失的。如果我们没有指定缺失值 (NA = not available) 标记符, 我们将无法进行统计分析。

从数组中创建: pandas.DataFrame 也可以视为1D序列, 例如数组或列表的字典,如果我们有3个numpy数组:

In [4]:

  1. import numpy as np
  2. t = np.linspace(-6, 6, 20)
  3. sin_t = np.sin(t)
  4. cos_t = np.cos(t)

我们可以将他们暴露为pandas.DataFrame:

In [5]:

  1. pandas.DataFrame({'t': t, 'sin': sin_t, 'cos': cos_t})

Out[5]:

cossint
00.9601700.279415-6.000000
10.6099770.792419-5.368421
20.0244510.999701-4.736842
3-0.5705090.821291-4.105263
4-0.9453630.326021-3.473684
5-0.955488-0.295030-2.842105
6-0.596979-0.802257-2.210526
7-0.008151-0.999967-1.578947
80.583822-0.811882-0.947368
90.950551-0.310567-0.315789
100.9505510.3105670.315789
110.5838220.8118820.947368
12-0.0081510.9999671.578947
13-0.5969790.8022572.210526
14-0.9554880.2950302.842105
15-0.945363-0.3260213.473684
16-0.570509-0.8212914.105263
170.024451-0.9997014.736842
180.609977-0.7924195.368421
190.960170-0.2794156.000000

其他输入: pandas 可以从SQL、excel文件或者其他格式输入数。见pandas文档

3.1.1.2.2 操作数据

datapandas.DataFrame, 与R的dataframe类似:

In [6]:

  1. data.shape # 40行8列

Out[6]:

  1. (40, 8)

In [7]:

  1. data.columns # 有列

Out[7]:

  1. Index([u'Unnamed: 0', u'Gender', u'FSIQ', u'VIQ', u'PIQ', u'Weight', u'Height',
  2. u'MRI_Count'],
  3. dtype='object')

In [8]:

  1. print(data['Gender']) # 列可以用名字访问
  1. 0 Female
  2. 1 Male
  3. 2 Male
  4. 3 Male
  5. 4 Female
  6. 5 Female
  7. 6 Female
  8. 7 Female
  9. 8 Male
  10. 9 Male
  11. 10 Female
  12. 11 Male
  13. 12 Male
  14. 13 Female
  15. 14 Female
  16. 15 Female
  17. 16 Female
  18. 17 Male
  19. 18 Female
  20. 19 Male
  21. 20 Male
  22. 21 Male
  23. 22 Female
  24. 23 Male
  25. 24 Female
  26. 25 Male
  27. 26 Female
  28. 27 Male
  29. 28 Female
  30. 29 Female
  31. 30 Female
  32. 31 Male
  33. 32 Male
  34. 33 Male
  35. 34 Female
  36. 35 Female
  37. 36 Male
  38. 37 Female
  39. 38 Male
  40. 39 Male
  41. Name: Gender, dtype: object

In [9]:

  1. # 简单选择器
  2. data[data['Gender'] == 'Female']['VIQ'].mean()

Out[9]:

  1. 109.45

注意: 对于一个大dataframe的快速预览,用它的describe方法: pandas.DataFrame.describe()

groupby: 根据类别变量的值拆分dataframe:

In [10]:

  1. groupby_gender = data.groupby('Gender')
  2. for gender, value in groupby_gender['VIQ']:
  3. print((gender, value.mean()))
  1. ('Female', 109.45)
  2. ('Male', 115.25)

groupby_gender是一个强力的对象,暴露了结果dataframes组的许多操作:

In [11]:

  1. groupby_gender.mean()

Out[11]:

Unnamed: 0FSIQVIQPIQWeightHeightMRI_Count
Gender
Female19.65111.9109.45110.45137.20000065.765000862654.6
Male21.35115.0115.25111.60166.44444471.431579954855.4

groupby_gender上使用tab-完成来查找更多。其他的常见分组函数是median, count (对于检查不同子集的缺失值数量很有用) 或sum。Groupby评估是懒惰模式,因为在应用聚合函数之前不会进行什么工作。

练习

  • 完整人口VIO的平均值是多少?
  • 这项研究中包含了多少男性 / 女性?
  • 提示 使用‘tab完成’来寻找可以调用的方法, 替换在上面例子中的‘mean’。
  • 对于男性和女性来说,以log为单位显示的MRI count平均值是多少?

3.1.1.2 panda data-frame - 图1

注意: 上面的绘图中使用了groupby_gender.boxplot (见这个例子)。

3.1.1.2.3 绘制数据

Pandas提供一些绘图工具 (pandas.tools.plotting, 后面使用的是matplotlib) 来显示在dataframes数据的统计值:

散点图矩阵:

In [15]:

  1. from pandas.tools import plotting
  2. plotting.scatter_matrix(data[['Weight', 'Height', 'MRI_Count']])

Out[15]:

  1. array([[<matplotlib.axes._subplots.AxesSubplot object at 0x105c34810>,
  2. <matplotlib.axes._subplots.AxesSubplot object at 0x10a0ade10>,
  3. <matplotlib.axes._subplots.AxesSubplot object at 0x10a2d80d0>],
  4. [<matplotlib.axes._subplots.AxesSubplot object at 0x10a33b210>,
  5. <matplotlib.axes._subplots.AxesSubplot object at 0x10a3be450>,
  6. <matplotlib.axes._subplots.AxesSubplot object at 0x10a40d9d0>],
  7. [<matplotlib.axes._subplots.AxesSubplot object at 0x10a49dc10>,
  8. <matplotlib.axes._subplots.AxesSubplot object at 0x10a51f850>,
  9. <matplotlib.axes._subplots.AxesSubplot object at 0x10a5902d0>]], dtype=object)
  1. /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  2. if self._edgecolors == str('face'):

3.1.1.2 panda data-frame - 图2

In [16]:

  1. plotting.scatter_matrix(data[['PIQ', 'VIQ', 'FSIQ']])

Out[16]:

  1. array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10a918b50>,
  2. <matplotlib.axes._subplots.AxesSubplot object at 0x10aa38710>,
  3. <matplotlib.axes._subplots.AxesSubplot object at 0x10ab29910>],
  4. [<matplotlib.axes._subplots.AxesSubplot object at 0x10ab8e790>,
  5. <matplotlib.axes._subplots.AxesSubplot object at 0x10ae207d0>,
  6. <matplotlib.axes._subplots.AxesSubplot object at 0x10abbd090>],
  7. [<matplotlib.axes._subplots.AxesSubplot object at 0x10af140d0>,
  8. <matplotlib.axes._subplots.AxesSubplot object at 0x10af89cd0>,
  9. <matplotlib.axes._subplots.AxesSubplot object at 0x10affa410>]], dtype=object)

3.1.1.2 panda data-frame - 图3

两个总体

IQ指标是双峰的, 似乎有两个子总体。

练习

只绘制男性的散点图矩阵,然后是只有女性的。你是否认为2个子总体与性别相关?