返回视图与副本

When setting values in a pandas object, care must be taken to avoid what is called chained indexing. Here is an example.

  1. In [339]: dfmi = pd.DataFrame([list('abcd'),
  2. .....: list('efgh'),
  3. .....: list('ijkl'),
  4. .....: list('mnop')],
  5. .....: columns=pd.MultiIndex.from_product([['one','two'],
  6. .....: ['first','second']]))
  7. .....:
  8. In [340]: dfmi
  9. Out[340]:
  10. one two
  11. first second first second
  12. 0 a b c d
  13. 1 e f g h
  14. 2 i j k l
  15. 3 m n o p

Compare these two access methods:

  1. In [341]: dfmi['one']['second']
  2. Out[341]:
  3. 0 b
  4. 1 f
  5. 2 j
  6. 3 n
  7. Name: second, dtype: object
  1. In [342]: dfmi.loc[:,('one','second')]
  2. Out[342]:
  3. 0 b
  4. 1 f
  5. 2 j
  6. 3 n
  7. Name: (one, second), dtype: object

These both yield the same results, so which should you use? It is instructive to understand the order of operations on these and why method 2 (.loc) is much preferred over method 1 (chained []).

dfmi['one'] selects the first level of the columns and returns a DataFrame that is singly-indexed. Then another Python operation dfmi_with_one['second'] selects the series indexed by 'second'. This is indicated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to __getitem__, so it has to treat them as linear operations, they happen one after another.

Contrast this to df.loc[:,('one','second')] which passes a nested tuple of (slice(None),('one','second')) to a single call to __getitem__. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly faster, and allows one to index both axes if so desired.

Why does assignment fail when using chained indexing?

The problem in the previous section is just a performance issue. What’s up with the SettingWithCopy warning? We don’t usually throw warnings around when you do something that might cost a few extra milliseconds!

But it turns out that assigning to the product of chained indexing has inherently unpredictable results. To see this, think about how the Python interpreter executes this code:

  1. dfmi.loc[:,('one','second')] = value
  2. # becomes
  3. dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

But this code is handled differently:

  1. dfmi['one']['second'] = value
  2. # becomes
  3. dfmi.__getitem__('one').__setitem__('second', value)

See that __getitem__ in there? Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the __setitem__ will modify dfmi or a temporary object that gets thrown out immediately afterward. That’s what SettingWithCopy is warning you about!

Note: You may be wondering whether we should be concerned about the loc property in the first example. But dfmi.loc is guaranteed to be dfmi itself with modified indexing behavior, so dfmi.loc.__getitem__ / dfmi.loc.__setitem__ operate on dfmi directly. Of course, dfmi.loc.__getitem__(idx) may be a view or a copy of dfmi.

Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going on. These are the bugs that SettingWithCopy is designed to catch! Pandas is probably trying to warn you that you’ve done this:

  1. def do_something(df):
  2. foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
  3. # ... many lines here ...
  4. foo['quux'] = value # We don't know whether this will modify df or not!
  5. return foo

Yikes!

Evaluation order matters

When you use chained indexing, the order and type of the indexing operation partially determine whether the result is a slice into the original object, or a copy of the slice.

Pandas has the SettingWithCopyWarning because assigning to a copy of a slice is frequently not intentional, but a mistake caused by chained indexing returning a copy where a slice was expected.

If you would like pandas to be more or less trusting about assignment to a chained indexing expression, you can set the option mode.chained_assignment to one of these values:

  • 'warn', the default, means a SettingWithCopyWarning is printed.
  • 'raise' means pandas will raise a SettingWithCopyException you have to deal with.
  • None will suppress the warnings entirely.
  1. In [343]: dfb = pd.DataFrame({'a' : ['one', 'one', 'two',
  2. .....: 'three', 'two', 'one', 'six'],
  3. .....: 'c' : np.arange(7)})
  4. .....:
  5. # This will show the SettingWithCopyWarning
  6. # but the frame values will be set
  7. In [344]: dfb['c'][dfb.a.str.startswith('o')] = 42

This however is operating on a copy and will not work.

  1. >>> pd.set_option('mode.chained_assignment','warn')
  2. >>> dfb[dfb.a.str.startswith('o')]['c'] = 42
  3. Traceback (most recent call last)
  4. ...
  5. SettingWithCopyWarning:
  6. A value is trying to be set on a copy of a slice from a DataFrame.
  7. Try using .loc[row_index,col_indexer] = value instead

A chained assignment can also crop up in setting in a mixed dtype frame.

Note: These setting rules apply to all of .loc/.iloc.

This is the correct access method:

  1. In [345]: dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})
  2. In [346]: dfc.loc[0,'A'] = 11
  3. In [347]: dfc
  4. Out[347]:
  5. A B
  6. 0 11 1
  7. 1 bbb 2
  8. 2 ccc 3

This can work at times, but it is not guaranteed to, and therefore should be avoided:

  1. In [348]: dfc = dfc.copy()
  2. In [349]: dfc['A'][0] = 111
  3. In [350]: dfc
  4. Out[350]:
  5. A B
  6. 0 111 1
  7. 1 bbb 2
  8. 2 ccc 3

This will not work at all, and so should be avoided:

  1. >>> pd.set_option('mode.chained_assignment','raise')
  2. >>> dfc.loc[0]['A'] = 1111
  3. Traceback (most recent call last)
  4. ...
  5. SettingWithCopyException:
  6. A value is trying to be set on a copy of a slice from a DataFrame.
  7. Try using .loc[row_index,col_indexer] = value instead

警告

The chained assignment warnings / exceptions are aiming to inform the user of a possibly invalid assignment. There may be false positives; situations where a chained assignment is inadvertently reported.