where()方法和Masking

Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.

To return only the selected rows:

  1. In [177]: s[s > 0]
  2. Out[177]:
  3. 3 1
  4. 2 2
  5. 1 3
  6. 0 4
  7. dtype: int64

To return a Series of the same shape as the original:

  1. In [178]: s.where(s > 0)
  2. Out[178]:
  3. 4 NaN
  4. 3 1.0
  5. 2 2.0
  6. 1 3.0
  7. 0 4.0
  8. dtype: float64

Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. where is used under the hood as the implementation. The code below is equivalent to df.where(df < 0).

  1. In [179]: df[df < 0]
  2. Out[179]:
  3. A B C D
  4. 2000-01-01 -2.104139 -1.309525 NaN NaN
  5. 2000-01-02 -0.352480 NaN -1.192319 NaN
  6. 2000-01-03 -0.864883 NaN -0.227870 NaN
  7. 2000-01-04 NaN -1.222082 NaN -1.233203
  8. 2000-01-05 NaN -0.605656 -1.169184 NaN
  9. 2000-01-06 NaN -0.948458 NaN -0.684718
  10. 2000-01-07 -2.670153 -0.114722 NaN -0.048048
  11. 2000-01-08 NaN NaN -0.048788 -0.808838

In addition, where takes an optional other argument for replacement of values where the condition is False, in the returned copy.

  1. In [180]: df.where(df < 0, -df)
  2. Out[180]:
  3. A B C D
  4. 2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
  5. 2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
  6. 2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
  7. 2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
  8. 2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
  9. 2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
  10. 2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
  11. 2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

You may wish to set values based on some boolean criteria. This can be done intuitively like so:

  1. In [181]: s2 = s.copy()
  2. In [182]: s2[s2 < 0] = 0
  3. In [183]: s2
  4. Out[183]:
  5. 4 0
  6. 3 1
  7. 2 2
  8. 1 3
  9. 0 4
  10. dtype: int64
  11. In [184]: df2 = df.copy()
  12. In [185]: df2[df2 < 0] = 0
  13. In [186]: df2
  14. Out[186]:
  15. A B C D
  16. 2000-01-01 0.000000 0.000000 0.485855 0.245166
  17. 2000-01-02 0.000000 0.390389 0.000000 1.655824
  18. 2000-01-03 0.000000 0.299674 0.000000 0.281059
  19. 2000-01-04 0.846958 0.000000 0.600705 0.000000
  20. 2000-01-05 0.669692 0.000000 0.000000 0.342416
  21. 2000-01-06 0.868584 0.000000 2.297780 0.000000
  22. 2000-01-07 0.000000 0.000000 0.168904 0.000000
  23. 2000-01-08 0.801196 1.392071 0.000000 0.000000

By default, where returns a modified copy of the data. There is an optional parameter inplace so that the original data can be modified without creating a copy:

  1. In [187]: df_orig = df.copy()
  2. In [188]: df_orig.where(df > 0, -df, inplace=True);
  3. In [189]: df_orig
  4. Out[189]:
  5. A B C D
  6. 2000-01-01 2.104139 1.309525 0.485855 0.245166
  7. 2000-01-02 0.352480 0.390389 1.192319 1.655824
  8. 2000-01-03 0.864883 0.299674 0.227870 0.281059
  9. 2000-01-04 0.846958 1.222082 0.600705 1.233203
  10. 2000-01-05 0.669692 0.605656 1.169184 0.342416
  11. 2000-01-06 0.868584 0.948458 2.297780 0.684718
  12. 2000-01-07 2.670153 0.114722 0.168904 0.048048
  13. 2000-01-08 0.801196 1.392071 0.048788 0.808838

Note:The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

  1. In [190]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
  2. Out[190]:
  3. A B C D
  4. 2000-01-01 True True True True
  5. 2000-01-02 True True True True
  6. 2000-01-03 True True True True
  7. 2000-01-04 True True True True
  8. 2000-01-05 True True True True
  9. 2000-01-06 True True True True
  10. 2000-01-07 True True True True
  11. 2000-01-08 True True True True

alignment

Furthermore, where aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to partial setting via .loc (but on the contents rather than the axis labels).

  1. In [191]: df2 = df.copy()
  2. In [192]: df2[ df2[1:4] > 0] = 3
  3. In [193]: df2
  4. Out[193]:
  5. A B C D
  6. 2000-01-01 -2.104139 -1.309525 0.485855 0.245166
  7. 2000-01-02 -0.352480 3.000000 -1.192319 3.000000
  8. 2000-01-03 -0.864883 3.000000 -0.227870 3.000000
  9. 2000-01-04 3.000000 -1.222082 3.000000 -1.233203
  10. 2000-01-05 0.669692 -0.605656 -1.169184 0.342416
  11. 2000-01-06 0.868584 -0.948458 2.297780 -0.684718
  12. 2000-01-07 -2.670153 -0.114722 0.168904 -0.048048
  13. 2000-01-08 0.801196 1.392071 -0.048788 -0.808838

Where can also accept axis and level parameters to align the input when performing the where.

  1. In [194]: df2 = df.copy()
  2. In [195]: df2.where(df2>0,df2['A'],axis='index')
  3. Out[195]:
  4. A B C D
  5. 2000-01-01 -2.104139 -2.104139 0.485855 0.245166
  6. 2000-01-02 -0.352480 0.390389 -0.352480 1.655824
  7. 2000-01-03 -0.864883 0.299674 -0.864883 0.281059
  8. 2000-01-04 0.846958 0.846958 0.600705 0.846958
  9. 2000-01-05 0.669692 0.669692 0.669692 0.342416
  10. 2000-01-06 0.868584 0.868584 2.297780 0.868584
  11. 2000-01-07 -2.670153 -2.670153 0.168904 -2.670153
  12. 2000-01-08 0.801196 1.392071 0.801196 0.801196

This is equivalent to (but faster than) the following.

  1. In [196]: df2 = df.copy()
  2. In [197]: df.apply(lambda x, y: x.where(x>0,y), y=df['A'])
  3. Out[197]:
  4. A B C D
  5. 2000-01-01 -2.104139 -2.104139 0.485855 0.245166
  6. 2000-01-02 -0.352480 0.390389 -0.352480 1.655824
  7. 2000-01-03 -0.864883 0.299674 -0.864883 0.281059
  8. 2000-01-04 0.846958 0.846958 0.600705 0.846958
  9. 2000-01-05 0.669692 0.669692 0.669692 0.342416
  10. 2000-01-06 0.868584 0.868584 2.297780 0.868584
  11. 2000-01-07 -2.670153 -2.670153 0.168904 -2.670153
  12. 2000-01-08 0.801196 1.392071 0.801196 0.801196

New in version 0.18.1.

Where can accept a callable as condition and other arguments. The function must be with one argument (the calling Series or DataFrame) and that returns valid output as condition and other argument.

  1. In [198]: df3 = pd.DataFrame({'A': [1, 2, 3],
  2. .....: 'B': [4, 5, 6],
  3. .....: 'C': [7, 8, 9]})
  4. .....:
  5. In [199]: df3.where(lambda x: x > 4, lambda x: x + 10)
  6. Out[199]:
  7. A B C
  8. 0 11 14 7
  9. 1 12 5 8
  10. 2 13 6 9

Mask

mask() is the inverse boolean operation of where.

  1. In [200]: s.mask(s >= 0)
  2. Out[200]:
  3. 4 NaN
  4. 3 NaN
  5. 2 NaN
  6. 1 NaN
  7. 0 NaN
  8. dtype: float64
  9. In [201]: df.mask(df >= 0)
  10. Out[201]:
  11. A B C D
  12. 2000-01-01 -2.104139 -1.309525 NaN NaN
  13. 2000-01-02 -0.352480 NaN -1.192319 NaN
  14. 2000-01-03 -0.864883 NaN -0.227870 NaN
  15. 2000-01-04 NaN -1.222082 NaN -1.233203
  16. 2000-01-05 NaN -0.605656 -1.169184 NaN
  17. 2000-01-06 NaN -0.948458 NaN -0.684718
  18. 2000-01-07 -2.670153 -0.114722 NaN -0.048048
  19. 2000-01-08 NaN NaN -0.048788 -0.808838