描述性统计

There exists a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame, and Panel. Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the axis can be specified by name or integer:

  • Series: no axis argument needed
  • DataFrame: “index” (axis=0, default), “columns” (axis=1)
  • Panel: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)

For example:

  1. In [77]: df
  2. Out[77]:
  3. one two three
  4. a -1.101558 1.124472 NaN
  5. b -0.177289 2.487104 -0.634293
  6. c 0.462215 -0.486066 1.931194
  7. d NaN -0.456288 -1.222918
  8. In [78]: df.mean(0)
  9. Out[78]:
  10. one -0.272211
  11. two 0.667306
  12. three 0.024661
  13. dtype: float64
  14. In [79]: df.mean(1)
  15. Out[79]:
  16. a 0.011457
  17. b 0.558507
  18. c 0.635781
  19. d -0.839603
  20. dtype: float64

All such methods have a skipna option signaling whether to exclude missing data (True by default):

  1. In [80]: df.sum(0, skipna=False)
  2. Out[80]:
  3. one NaN
  4. two 2.669223
  5. three NaN
  6. dtype: float64
  7. In [81]: df.sum(axis=1, skipna=True)
  8. Out[81]:
  9. a 0.022914
  10. b 1.675522
  11. c 1.907343
  12. d -1.679206
  13. dtype: float64

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:

  1. In [82]: ts_stand = (df - df.mean()) / df.std()
  2. In [83]: ts_stand.std()
  3. Out[83]:
  4. one 1.0
  5. two 1.0
  6. three 1.0
  7. dtype: float64
  8. In [84]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
  9. In [85]: xs_stand.std(1)
  10. Out[85]:
  11. a 1.0
  12. b 1.0
  13. c 1.0
  14. d 1.0
  15. dtype: float64

Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat different from expanding() and rolling(). For more details please see this note.

  1. In [86]: df.cumsum()
  2. Out[86]:
  3. one two three
  4. a -1.101558 1.124472 NaN
  5. b -1.278848 3.611576 -0.634293
  6. c -0.816633 3.125511 1.296901
  7. d NaN 2.669223 0.073983

Here is a quick reference summary table of common functions. Each also takes an optional level parameter which applies only if the object has a hierarchical index.

FunctionDescription
countNumber of non-NA observations
sumSum of values
meanMean of values
madMean absolute deviation
medianArithmetic median of values
minMinimum
maxMaximum
modeMode
absAbsolute Value
prodProduct of values
stdBessel-corrected sample standard deviation
varUnbiased variance
semStandard error of the mean
skewSample skewness (3rd moment)
kurtSample kurtosis (4th moment)
quantileSample quantile (value at %)
cumsumCumulative sum
cumprodCumulative product
cummaxCumulative maximum
cumminCumulative minimum

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:

  1. In [87]: np.mean(df['one'])
  2. Out[87]: -0.27221094480450114
  3. In [88]: np.mean(df['one'].values)
  4. Out[88]: nan

Series.nunique() will return the number of unique non-NA values in a Series:

  1. In [89]: series = pd.Series(np.random.randn(500))
  2. In [90]: series[20:500] = np.nan
  3. In [91]: series[10:20] = 5
  4. In [92]: series.nunique()
  5. Out[92]: 11

Summarizing data: describe

There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course):

  1. In [93]: series = pd.Series(np.random.randn(1000))
  2. In [94]: series[::2] = np.nan
  3. In [95]: series.describe()
  4. Out[95]:
  5. count 500.000000
  6. mean -0.032127
  7. std 1.067484
  8. min -3.463789
  9. 25% -0.725523
  10. 50% -0.053230
  11. 75% 0.679790
  12. max 3.120271
  13. dtype: float64
  14. In [96]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
  15. In [97]: frame.iloc[::2] = np.nan
  16. In [98]: frame.describe()
  17. Out[98]:
  18. a b c d e
  19. count 500.000000 500.000000 500.000000 500.000000 500.000000
  20. mean -0.045109 -0.052045 0.024520 0.006117 0.001141
  21. std 1.029268 1.002320 1.042793 1.040134 1.005207
  22. min -2.915767 -3.294023 -3.610499 -2.907036 -3.010899
  23. 25% -0.763783 -0.720389 -0.609600 -0.665896 -0.682900
  24. 50% -0.086033 -0.048843 0.006093 0.043191 -0.001651
  25. 75% 0.663399 0.620980 0.728382 0.735973 0.656439
  26. max 3.400646 2.925597 3.416896 3.331522 3.007143

You can select specific percentiles to include in the output:

  1. In [99]: series.describe(percentiles=[.05, .25, .75, .95])
  2. Out[99]:
  3. count 500.000000
  4. mean -0.032127
  5. std 1.067484
  6. min -3.463789
  7. 5% -1.733545
  8. 25% -0.725523
  9. 50% -0.053230
  10. 75% 0.679790
  11. 95% 1.854383
  12. max 3.120271
  13. dtype: float64

By default, the median is always included.

For a non-numerical Series object, describe() will give a simple summary of the number of unique values and most frequently occurring values:

  1. In [100]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
  2. In [101]: s.describe()
  3. Out[101]:
  4. count 9
  5. unique 4
  6. top a
  7. freq 5
  8. dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical columns or, if none are, only categorical columns:

  1. In [102]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
  2. In [103]: frame.describe()
  3. Out[103]:
  4. b
  5. count 4.000000
  6. mean 1.500000
  7. std 1.290994
  8. min 0.000000
  9. 25% 0.750000
  10. 50% 1.500000
  11. 75% 2.250000
  12. max 3.000000

This behaviour can be controlled by providing a list of types as include/exclude arguments. The special value all can also be used:

  1. In [104]: frame.describe(include=['object'])
  2. Out[104]:
  3. a
  4. count 4
  5. unique 2
  6. top Yes
  7. freq 2
  8. In [105]: frame.describe(include=['number'])
  9. Out[105]:
  10. b
  11. count 4.000000
  12. mean 1.500000
  13. std 1.290994
  14. min 0.000000
  15. 25% 0.750000
  16. 50% 1.500000
  17. 75% 2.250000
  18. max 3.000000
  19. In [106]: frame.describe(include='all')
  20. Out[106]:
  21. a b
  22. count 4 4.000000
  23. unique 2 NaN
  24. top Yes NaN
  25. freq 2 NaN
  26. mean NaN 1.500000
  27. std NaN 1.290994
  28. min NaN 0.000000
  29. 25% NaN 0.750000
  30. 50% NaN 1.500000
  31. 75% NaN 2.250000
  32. max NaN 3.000000

That feature relies on select_dtypes. Refer to there for details about accepted inputs.

Index of Min/Max Values

The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

  1. In [107]: s1 = pd.Series(np.random.randn(5))
  2. In [108]: s1
  3. Out[108]:
  4. 0 -1.649461
  5. 1 0.169660
  6. 2 1.246181
  7. 3 0.131682
  8. 4 -2.001988
  9. dtype: float64
  10. In [109]: s1.idxmin(), s1.idxmax()
  11. Out[109]: (4, 2)
  12. In [110]: df1 = pd.DataFrame(np.random.randn(5,3), columns=['A','B','C'])
  13. In [111]: df1
  14. Out[111]:
  15. A B C
  16. 0 -1.273023 0.870502 0.214583
  17. 1 0.088452 -0.173364 1.207466
  18. 2 0.546121 0.409515 -0.310515
  19. 3 0.585014 -0.490528 -0.054639
  20. 4 -0.239226 0.701089 0.228656
  21. In [112]: df1.idxmin(axis=0)
  22. Out[112]:
  23. A 0
  24. B 3
  25. C 2
  26. dtype: int64
  27. In [113]: df1.idxmax(axis=1)
  28. Out[113]:
  29. 0 B
  30. 1 C
  31. 2 A
  32. 3 A
  33. 4 B
  34. dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax() return the first matching index:

  1. In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
  2. In [115]: df3
  3. Out[115]:
  4. A
  5. e 2.0
  6. d 1.0
  7. c 1.0
  8. b 3.0
  9. a NaN
  10. In [116]: df3['A'].idxmin()
  11. Out[116]: 'd'

Note: idxmin and idxmax are called argmin and argmax in NumPy.

Value counts (histogramming) / Mode

The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays:

  1. In [117]: data = np.random.randint(0, 7, size=50)
  2. In [118]: data
  3. Out[118]:
  4. array([3, 3, 0, 2, 1, 0, 5, 5, 3, 6, 1, 5, 6, 2, 0, 0, 6, 3, 3, 5, 0, 4, 3,
  5. 3, 3, 0, 6, 1, 3, 5, 5, 0, 4, 0, 6, 3, 6, 5, 4, 3, 2, 1, 5, 0, 1, 1,
  6. 6, 4, 1, 4])
  7. In [119]: s = pd.Series(data)
  8. In [120]: s.value_counts()
  9. Out[120]:
  10. 3 11
  11. 0 9
  12. 5 8
  13. 6 7
  14. 1 7
  15. 4 5
  16. 2 3
  17. dtype: int64
  18. In [121]: pd.value_counts(data)
  19. Out[121]:
  20. 3 11
  21. 0 9
  22. 5 8
  23. 6 7
  24. 1 7
  25. 4 5
  26. 2 3
  27. dtype: int64

Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:

  1. In [122]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])
  2. In [123]: s5.mode()
  3. Out[123]:
  4. 0 3
  5. 1 7
  6. dtype: int64
  7. In [124]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
  8. .....: "B": np.random.randint(-10, 15, size=50)})
  9. .....:
  10. In [125]: df5.mode()
  11. Out[125]:
  12. A B
  13. 0 2 -5

Discretization and quantiling

Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample quantiles) functions:

  1. In [126]: arr = np.random.randn(20)
  2. In [127]: factor = pd.cut(arr, 4)
  3. In [128]: factor
  4. Out[128]:
  5. [(-2.611, -1.58], (0.473, 1.499], (-2.611, -1.58], (-1.58, -0.554], (-0.554, 0.473], ..., (0.473, 1.499], (0.473, 1.499], (-0.554, 0.473], (-0.554, 0.473], (-0.554, 0.473]]
  6. Length: 20
  7. Categories (4, interval[float64]): [(-2.611, -1.58] < (-1.58, -0.554] < (-0.554, 0.473] <
  8. (0.473, 1.499]]
  9. In [129]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])
  10. In [130]: factor
  11. Out[130]:
  12. [(-5, -1], (0, 1], (-5, -1], (-1, 0], (-1, 0], ..., (1, 5], (1, 5], (-1, 0], (-1, 0], (-1, 0]]
  13. Length: 20
  14. Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() computes sample quantiles. For example, we could slice up some normally distributed data into equal-size quartiles like so:

  1. In [131]: arr = np.random.randn(30)
  2. In [132]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])
  3. In [133]: factor
  4. Out[133]:
  5. [(0.544, 1.976], (0.544, 1.976], (-1.255, -0.375], (0.544, 1.976], (-0.103, 0.544], ..., (-0.103, 0.544], (0.544, 1.976], (-0.103, 0.544], (-1.255, -0.375], (-0.375, -0.103]]
  6. Length: 30
  7. Categories (4, interval[float64]): [(-1.255, -0.375] < (-0.375, -0.103] < (-0.103, 0.544] <
  8. (0.544, 1.976]]
  9. In [134]: pd.value_counts(factor)
  10. Out[134]:
  11. (0.544, 1.976] 8
  12. (-1.255, -0.375] 8
  13. (-0.103, 0.544] 7
  14. (-0.375, -0.103] 7
  15. dtype: int64

We can also pass infinite values to define the bins:

  1. In [135]: arr = np.random.randn(20)
  2. In [136]: factor = pd.cut(arr, [-np.inf, 0, np.inf])
  3. In [137]: factor
  4. Out[137]:
  5. [(0.0, inf], (0.0, inf], (0.0, inf], (0.0, inf], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (0.0, inf], (-inf, 0.0], (0.0, inf]]
  6. Length: 20
  7. Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]