统计函数

Percent Change

Series, DataFrame, and Panel all have a method pct_change() to compute the percent change over a given number of periods (using fill_method to fill NA/null values before computing the percent change).

  1. In [1]: ser = pd.Series(np.random.randn(8))
  2. In [2]: ser.pct_change()
  3. Out[2]:
  4. 0 NaN
  5. 1 -1.602976
  6. 2 4.334938
  7. 3 -0.247456
  8. 4 -2.067345
  9. 5 -1.142903
  10. 6 -1.688214
  11. 7 -9.759729
  12. dtype: float64
  1. In [3]: df = pd.DataFrame(np.random.randn(10, 4))
  2. In [4]: df.pct_change(periods=3)
  3. Out[4]:
  4. 0 1 2 3
  5. 0 NaN NaN NaN NaN
  6. 1 NaN NaN NaN NaN
  7. 2 NaN NaN NaN NaN
  8. 3 -0.218320 -1.054001 1.987147 -0.510183
  9. 4 -0.439121 -1.816454 0.649715 -4.822809
  10. 5 -0.127833 -3.042065 -5.866604 -1.776977
  11. 6 -2.596833 -1.959538 -2.111697 -3.798900
  12. 7 -0.117826 -2.169058 0.036094 -0.067696
  13. 8 2.492606 -1.357320 -1.205802 -1.558697
  14. 9 -1.012977 2.324558 -1.003744 -0.371806

Covariance

Series.cov() can be used to compute covariance between series (excluding missing values).

  1. In [5]: s1 = pd.Series(np.random.randn(1000))
  2. In [6]: s2 = pd.Series(np.random.randn(1000))
  3. In [7]: s1.cov(s2)
  4. Out[7]: 0.00068010881743108204

Analogously, DataFrame.cov() to compute pairwise covariances among the series in the DataFrame, also excluding NA/null values.

Note: Assuming the missing data are missing at random this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

  1. In [8]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
  2. In [9]: frame.cov()
  3. Out[9]:
  4. a b c d e
  5. a 1.000882 -0.003177 -0.002698 -0.006889 0.031912
  6. b -0.003177 1.024721 0.000191 0.009212 0.000857
  7. c -0.002698 0.000191 0.950735 -0.031743 -0.005087
  8. d -0.006889 0.009212 -0.031743 1.002983 -0.047952
  9. e 0.031912 0.000857 -0.005087 -0.047952 1.042487

DataFrame.cov also supports an optional min_periods keyword that specifies the required minimum number of observations for each column pair in order to have a valid result.

  1. In [10]: frame = pd.DataFrame(np.random.randn(20, 3), columns=['a', 'b', 'c'])
  2. In [11]: frame.loc[frame.index[:5], 'a'] = np.nan
  3. In [12]: frame.loc[frame.index[5:10], 'b'] = np.nan
  4. In [13]: frame.cov()
  5. Out[13]:
  6. a b c
  7. a 1.123670 -0.412851 0.018169
  8. b -0.412851 1.154141 0.305260
  9. c 0.018169 0.305260 1.301149
  10. In [14]: frame.cov(min_periods=12)
  11. Out[14]:
  12. a b c
  13. a 1.123670 NaN 0.018169
  14. b NaN 1.154141 0.305260
  15. c 0.018169 0.305260 1.301149

Correlation

Correlation may be computed using the corr() method. Using the method parameter, several methods for computing correlations are provided:

Method name | Description pearson (default) | Standard correlation coefficient kendall | Kendall Tau correlation coefficient spearman | Spearman rank correlation coefficient

All of these are currently computed using pairwise complete observations. Wikipedia has articles covering the above correlation coefficients:

Note:Please see the caveats associated with this method of calculating correlation matrices in the covariance section.

  1. In [15]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
  2. In [16]: frame.iloc[::2] = np.nan
  3. # Series with Series
  4. In [17]: frame['a'].corr(frame['b'])
  5. Out[17]: 0.013479040400098794
  6. In [18]: frame['a'].corr(frame['b'], method='spearman')
  7. Out[18]: -0.0072898851595406371
  8. # Pairwise correlation of DataFrame columns
  9. In [19]: frame.corr()
  10. Out[19]:
  11. a b c d e
  12. a 1.000000 0.013479 -0.049269 -0.042239 -0.028525
  13. b 0.013479 1.000000 -0.020433 -0.011139 0.005654
  14. c -0.049269 -0.020433 1.000000 0.018587 -0.054269
  15. d -0.042239 -0.011139 0.018587 1.000000 -0.017060
  16. e -0.028525 0.005654 -0.054269 -0.017060 1.000000

Note that non-numeric columns will be automatically excluded from the correlation calculation.

Like cov, corr also supports the optional min_periods keyword:

  1. In [20]: frame = pd.DataFrame(np.random.randn(20, 3), columns=['a', 'b', 'c'])
  2. In [21]: frame.loc[frame.index[:5], 'a'] = np.nan
  3. In [22]: frame.loc[frame.index[5:10], 'b'] = np.nan
  4. In [23]: frame.corr()
  5. Out[23]:
  6. a b c
  7. a 1.000000 -0.121111 0.069544
  8. b -0.121111 1.000000 0.051742
  9. c 0.069544 0.051742 1.000000
  10. In [24]: frame.corr(min_periods=12)
  11. Out[24]:
  12. a b c
  13. a 1.000000 NaN 0.069544
  14. b NaN 1.000000 0.051742
  15. c 0.069544 0.051742 1.000000

A related method corrwith() is implemented on DataFrame to compute the correlation between like-labeled Series contained in different DataFrame objects.

  1. In [25]: index = ['a', 'b', 'c', 'd', 'e']
  2. In [26]: columns = ['one', 'two', 'three', 'four']
  3. In [27]: df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)
  4. In [28]: df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)
  5. In [29]: df1.corrwith(df2)
  6. Out[29]:
  7. one -0.125501
  8. two -0.493244
  9. three 0.344056
  10. four 0.004183
  11. dtype: float64
  12. In [30]: df2.corrwith(df1, axis=1)
  13. Out[30]:
  14. a -0.675817
  15. b 0.458296
  16. c 0.190809
  17. d -0.186275
  18. e NaN
  19. dtype: float64

Data ranking

The rank() method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:

  1. In [31]: s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
  2. In [32]: s['d'] = s['b'] # so there's a tie
  3. In [33]: s.rank()
  4. Out[33]:
  5. a 5.0
  6. b 2.5
  7. c 1.0
  8. d 2.5
  9. e 4.0
  10. dtype: float64

rank() is also a DataFrame method and can rank either the rows (axis=0) or the columns (axis=1). NaN values are excluded from the ranking.

  1. In [34]: df = pd.DataFrame(np.random.np.random.randn(10, 6))
  2. In [35]: df[4] = df[2][:5] # some ties
  3. In [36]: df
  4. Out[36]:
  5. 0 1 2 3 4 5
  6. 0 -0.904948 -1.163537 -1.457187 0.135463 -1.457187 0.294650
  7. 1 -0.976288 -0.244652 -0.748406 -0.999601 -0.748406 -0.800809
  8. 2 0.401965 1.460840 1.256057 1.308127 1.256057 0.876004
  9. 3 0.205954 0.369552 -0.669304 0.038378 -0.669304 1.140296
  10. 4 -0.477586 -0.730705 -1.129149 -0.601463 -1.129149 -0.211196
  11. 5 -1.092970 -0.689246 0.908114 0.204848 NaN 0.463347
  12. 6 0.376892 0.959292 0.095572 -0.593740 NaN -0.069180
  13. 7 -1.002601 1.957794 -0.120708 0.094214 NaN -1.467422
  14. 8 -0.547231 0.664402 -0.519424 -0.073254 NaN -1.263544
  15. 9 -0.250277 -0.237428 -1.056443 0.419477 NaN 1.375064
  16. In [37]: df.rank(1)
  17. Out[37]:
  18. 0 1 2 3 4 5
  19. 0 4.0 3.0 1.5 5.0 1.5 6.0
  20. 1 2.0 6.0 4.5 1.0 4.5 3.0
  21. 2 1.0 6.0 3.5 5.0 3.5 2.0
  22. 3 4.0 5.0 1.5 3.0 1.5 6.0
  23. 4 5.0 3.0 1.5 4.0 1.5 6.0
  24. 5 1.0 2.0 5.0 3.0 NaN 4.0
  25. 6 4.0 5.0 3.0 1.0 NaN 2.0
  26. 7 2.0 5.0 3.0 4.0 NaN 1.0
  27. 8 2.0 5.0 3.0 4.0 NaN 1.0
  28. 9 2.0 3.0 1.0 4.0 NaN 5.0

rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.

rank supports different tie-breaking methods, specified with the method parameter:

  • average : average rank of tied group
  • min : lowest rank in the group
  • max : highest rank in the group
  • first : ranks assigned in the order they appear in the array