计算缺少的数据

Missing values propagate naturally through arithmetic operations between pandas objects.

  1. In [27]: a
  2. Out[27]:
  3. one two
  4. a NaN 0.501113
  5. c NaN 0.580967
  6. e 0.057802 0.761948
  7. f -0.443160 -0.974602
  8. h -0.443160 -1.053898
  9. In [28]: b
  10. Out[28]:
  11. one two three
  12. a NaN 0.501113 -0.355322
  13. c NaN 0.580967 0.983801
  14. e 0.057802 0.761948 -0.712964
  15. f -0.443160 -0.974602 1.047704
  16. h NaN -1.053898 -0.019369
  17. In [29]: a + b
  18. Out[29]:
  19. one three two
  20. a NaN NaN 1.002226
  21. c NaN NaN 1.161935
  22. e 0.115604 NaN 1.523896
  23. f -0.886321 NaN -1.949205
  24. h NaN NaN -2.107796

The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all written to account for missing data. For example:

  • When summing data, NA (missing) values will be treated as zero.
  • If the data are all NA, the result will be 0.
  • Cumulative methods like cumsum() and cumprod() ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use skipna=False.
  1. In [30]: df
  2. Out[30]:
  3. one two three
  4. a NaN 0.501113 -0.355322
  5. c NaN 0.580967 0.983801
  6. e 0.057802 0.761948 -0.712964
  7. f -0.443160 -0.974602 1.047704
  8. h NaN -1.053898 -0.019369
  9. In [31]: df['one'].sum()
  10. Out[31]: -0.38535826528461409
  11. In [32]: df.mean(1)
  12. Out[32]:
  13. a 0.072895
  14. c 0.782384
  15. e 0.035595
  16. f -0.123353
  17. h -0.536633
  18. dtype: float64
  19. In [33]: df.cumsum()
  20. Out[33]:
  21. one two three
  22. a NaN 0.501113 -0.355322
  23. c NaN 1.082080 0.628479
  24. e 0.057802 1.844028 -0.084485
  25. f -0.385358 0.869426 0.963219
  26. h NaN -0.184472 0.943850
  27. In [34]: df.cumsum(skipna=False)
  28. Out[34]:
  29. one two three
  30. a NaN 0.501113 -0.355322
  31. c NaN 1.082080 0.628479
  32. e NaN 1.844028 -0.084485
  33. f NaN 0.869426 0.963219
  34. h NaN -0.184472 0.943850

Sum/Prod of Empties/Nans

警告

This behavior is now standard as of v0.22.0 and is consistent with the default in numpy; previously sum/prod of all-NA or empty Series/DataFrames would return NaN. See v0.22.0 whatsnew for more.

The sum of an empty or all-NA Series or column of a DataFrame is 0.

  1. In [35]: pd.Series([np.nan]).sum()
  2. Out[35]: 0.0
  3. In [36]: pd.Series([]).sum()
  4. Out[36]: 0.0

The product of an empty or all-NA Series or column of a DataFrame is 1.

  1. In [37]: pd.Series([np.nan]).prod()
  2. Out[37]: 1.0
  3. In [38]: pd.Series([]).prod()
  4. Out[38]: 1.0

NA values in GroupBy

NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example:

  1. In [39]: df
  2. Out[39]:
  3. one two three
  4. a NaN 0.501113 -0.355322
  5. c NaN 0.580967 0.983801
  6. e 0.057802 0.761948 -0.712964
  7. f -0.443160 -0.974602 1.047704
  8. h NaN -1.053898 -0.019369
  9. In [40]: df.groupby('one').mean()
  10. Out[40]:
  11. two three
  12. one
  13. -0.443160 -0.974602 1.047704
  14. 0.057802 0.761948 -0.712964

See the groupby section here for more information.