缺失数据相关基础知识

何时/为何数据丢失?

Some might quibble over our usage of missing. By “missing” we simply mean NA (“not available”) or “not present for whatever reason”. Many data sets simply arrive with missing data, either because it exists and was not collected or it never existed. For example, in a collection of financial time series, some of the time series might start on different dates. Thus, values prior to the start date would generally be marked as missing.

In pandas, one of the most common ways that missing data is introduced into a data set is by reindexing. For example:

  1. In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
  2. ...: columns=['one', 'two', 'three'])
  3. ...:
  4. In [2]: df['four'] = 'bar'
  5. In [3]: df['five'] = df['one'] > 0
  6. In [4]: df
  7. Out[4]:
  8. one two three four five
  9. a -0.166778 0.501113 -0.355322 bar False
  10. c -0.337890 0.580967 0.983801 bar False
  11. e 0.057802 0.761948 -0.712964 bar True
  12. f -0.443160 -0.974602 1.047704 bar False
  13. h -0.717852 -1.053898 -0.019369 bar False
  14. In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
  15. In [6]: df2
  16. Out[6]:
  17. one two three four five
  18. a -0.166778 0.501113 -0.355322 bar False
  19. b NaN NaN NaN NaN NaN
  20. c -0.337890 0.580967 0.983801 bar False
  21. d NaN NaN NaN NaN NaN
  22. e 0.057802 0.761948 -0.712964 bar True
  23. f -0.443160 -0.974602 1.047704 bar False
  24. g NaN NaN NaN NaN NaN
  25. h -0.717852 -1.053898 -0.019369 bar False

Values considered “missing”

As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

Note: If you want to consider inf and -inf to be “NA” in computations, you can set pandas.options.mode.use_inf_as_na = True.

To make detecting missing values easier (and across different array dtypes), pandas provides the isna() and notna() functions, which are also methods on Series and DataFrame objects:

  1. In [7]: df2['one']
  2. Out[7]:
  3. a -0.166778
  4. b NaN
  5. c -0.337890
  6. d NaN
  7. e 0.057802
  8. f -0.443160
  9. g NaN
  10. h -0.717852
  11. Name: one, dtype: float64
  12. In [8]: pd.isna(df2['one'])
  13. Out[8]:
  14. a False
  15. b True
  16. c False
  17. d True
  18. e False
  19. f False
  20. g True
  21. h False
  22. Name: one, dtype: bool
  23. In [9]: df2['four'].notna()
  24. Out[9]:
  25. a True
  26. b False
  27. c True
  28. d False
  29. e True
  30. f True
  31. g False
  32. h True
  33. Name: four, dtype: bool
  34. In [10]: df2.isna()
  35. Out[10]:
  36. one two three four five
  37. a False False False False False
  38. b True True True True True
  39. c False False False False False
  40. d True True True True True
  41. e False False False False False
  42. f False False False False False
  43. g True True True True True
  44. h False False False False False

警告

One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None’s do. Note that pandas/NumPy uses the fact that np.nan != np.nan , and treats None like np.nan.

  1. In [11]: None == None
  2. Out[11]: True

So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.

  1. In [13]: df2['one'] == np.nan
  2. Out[13]:
  3. a False
  4. b False
  5. c False
  6. d False
  7. e False
  8. f False
  9. g False
  10. h False
  11. Name: one, dtype: bool