其他索引常见问题

Integer indexing

Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .loc. The following code will generate exceptions:

  1. s = pd.Series(range(5))
  2. s[-1]
  3. df = pd.DataFrame(np.random.randn(5, 4))
  4. df
  5. df.loc[-2:]

This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).

Non-monotonic indexes require exact matches

If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds of a label-based slice can be outside the range of the index, much like slice indexing a normal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing and is_monotonic_decreasing attributes.

  1. In [183]: df = pd.DataFrame(index=[2,3,3,4,5], columns=['data'], data=list(range(5)))
  2. In [184]: df.index.is_monotonic_increasing
  3. Out[184]: True
  4. # no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
  5. In [185]: df.loc[0:4, :]
  6. Out[185]:
  7. data
  8. 2 0
  9. 3 1
  10. 3 2
  11. 4 3
  12. # slice is are outside the index, so empty DataFrame is returned
  13. In [186]: df.loc[13:15, :]
  14. Out[186]:
  15. Empty DataFrame
  16. Columns: [data]
  17. Index: []

On the other hand, if the index is not monotonic, then both slice bounds must be unique members of the index.

  1. In [187]: df = pd.DataFrame(index=[2,3,1,4,3,5], columns=['data'], data=list(range(6)))
  2. In [188]: df.index.is_monotonic_increasing
  3. Out[188]: False
  4. # OK because 2 and 4 are in the index
  5. In [189]: df.loc[2:4, :]
  6. Out[189]:
  7. data
  8. 2 0
  9. 3 1
  10. 1 2
  11. 4 3
  1. # 0 is not in the index
  2. In [9]: df.loc[0:4, :]
  3. KeyError: 0
  4. # 3 is not a unique label
  5. In [11]: df.loc[2:3, :]
  6. KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasing() and Index.is_monotonic_decreasing() only check that an index is weakly monotonic. To check for strict montonicity, you can combine one of those with Index.is_unique()

  1. In [190]: weakly_monotonic = pd.Index(['a', 'b', 'c', 'c'])
  2. In [191]: weakly_monotonic
  3. Out[191]: Index(['a', 'b', 'c', 'c'], dtype='object')
  4. In [192]: weakly_monotonic.is_monotonic_increasing
  5. Out[192]: True
  6. In [193]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
  7. Out[193]: False

Endpoints are inclusive

Compared with standard Python sequence slicing in which the slice endpoint is not inclusive, label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or next element after a particular label in an index. For example, consider the following Series:

  1. In [194]: s = pd.Series(np.random.randn(6), index=list('abcdef'))
  2. In [195]: s
  3. Out[195]:
  4. a 0.112246
  5. b 0.871721
  6. c -0.816064
  7. d -0.784880
  8. e 1.030659
  9. f 0.187483
  10. dtype: float64

Suppose we wished to slice from c to e, using integers this would be accomplished as such:

  1. In [196]: s[2:5]
  2. Out[196]:
  3. c -0.816064
  4. d -0.784880
  5. e 1.030659
  6. dtype: float64

However, if you only had c and e, determining the next element in the index can be somewhat complicated. For example, the following does not work:

  1. s.loc['c':'e'+1]

A very common use case is to limit a time series to start and end at two specific dates. To enable this, we made the design to make label-based slicing include both endpoints:

  1. In [197]: s.loc['c':'e']
  2. Out[197]:
  3. c -0.816064
  4. d -0.784880
  5. e 1.030659
  6. dtype: float64

This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.

Indexing potentially changes underlying Series dtype

The different indexing operation can potentially change the dtype of a Series.

  1. In [198]: series1 = pd.Series([1, 2, 3])
  2. In [199]: series1.dtype
  3. Out[199]: dtype('int64')
  4. In [200]: res = series1.reindex([0, 4])
  5. In [201]: res.dtype
  6. Out[201]: dtype('float64')
  7. In [202]: res
  8. Out[202]:
  9. 0 1.0
  10. 4 NaN
  11. dtype: float64
  1. In [203]: series2 = pd.Series([True])
  2. In [204]: series2.dtype
  3. Out[204]: dtype('bool')
  4. In [205]: res = series2.reindex_like(series1)
  5. In [206]: res.dtype
  6. Out[206]: dtype('O')
  7. In [207]: res
  8. Out[207]:
  9. 0 True
  10. 1 NaN
  11. 2 NaN
  12. dtype: object

This is because the (re)indexing operations above silently inserts NaNs and the dtype changes accordingly. This can cause some issues when using numpy ufuncs such as numpy.logical_and.

See the this old issue for a more detailed discussion.