索引类型

We have discussed MultiIndex in the previous sections pretty extensively. DatetimeIndex and PeriodIndex are shown here, and information about TimedeltaIndex` is found here.

In the following sub-sections we will highlight some other index types.

ategoricalIndex

CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a container around a Categorical and allows efficient indexing and storage of an index with a large number of duplicated elements.

  1. In [125]: from pandas.api.types import CategoricalDtype
  2. In [126]: df = pd.DataFrame({'A': np.arange(6),
  3. .....: 'B': list('aabbca')})
  4. .....:
  5. In [127]: df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
  6. In [128]: df
  7. Out[128]:
  8. A B
  9. 0 0 a
  10. 1 1 a
  11. 2 2 b
  12. 3 3 b
  13. 4 4 c
  14. 5 5 a
  15. In [129]: df.dtypes
  16. Out[129]:
  17. A int64
  18. B category
  19. dtype: object
  20. In [130]: df.B.cat.categories
  21. Out[130]: Index(['c', 'a', 'b'], dtype='object')

Setting the index will create a CategoricalIndex.

  1. In [131]: df2 = df.set_index('B')
  2. In [132]: df2.index
  3. Out[132]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Indexing with __getitem__/.iloc/.loc works similarly to an Index with duplicates. The indexers must be in the category or the operation will raise a KeyError.

  1. In [133]: df2.loc['a']
  2. Out[133]:
  3. A
  4. B
  5. a 0
  6. a 1
  7. a 5

The CategoricalIndex is preserved after indexing:

  1. In [134]: df2.loc['a'].index
  2. Out[134]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Sorting the index will sort by the order of the categories (recall that we created the index with CategoricalDtype(list('cab')), so the sorted order is cab).

  1. In [135]: df2.sort_index()
  2. Out[135]:
  3. A
  4. B
  5. c 4
  6. a 0
  7. a 1
  8. a 5
  9. b 2
  10. b 3

Groupby operations on the index will preserve the index nature as well.

  1. In [136]: df2.groupby(level=0).sum()
  2. Out[136]:
  3. A
  4. B
  5. c 4
  6. a 6
  7. b 5
  8. In [137]: df2.groupby(level=0).sum().index
  9. Out[137]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Reindexing operations will return a resulting index based on the type of the passed indexer. Passing a list will return a plain-old Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the passed Categorical dtype. This allows one to arbitrarily index these even with values not in the categories, similarly to how you can reindex any pandas index.

  1. In [138]: df2.reindex(['a','e'])
  2. Out[138]:
  3. A
  4. B
  5. a 0.0
  6. a 1.0
  7. a 5.0
  8. e NaN
  9. In [139]: df2.reindex(['a','e']).index
  10. Out[139]: Index(['a', 'a', 'a', 'e'], dtype='object', name='B')
  11. In [140]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
  12. Out[140]:
  13. A
  14. B
  15. a 0.0
  16. a 1.0
  17. a 5.0
  18. e NaN
  19. In [141]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
  20. Out[141]: CategoricalIndex(['a', 'a', 'a', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, name='B', dtype='category')

警告

Reshaping and Comparison operations on a CategoricalIndex must have the same categories or a TypeError will be raised.

  1. In [9]: df3 = pd.DataFrame({'A' : np.arange(6),
  2. 'B' : pd.Series(list('aabbca')).astype('category')})
  3. In [11]: df3 = df3.set_index('B')
  4. In [11]: df3.index
  5. Out[11]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'a', u'b', u'c'], ordered=False, name=u'B', dtype='category')
  6. In [12]: pd.concat([df2, df3]
  7. TypeError: categories must match existing categories when appending

Int64Index and RangeIndex

警告

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.

Int64Index is a fundamental basic index in pandas. This is an Immutable array implementing an ordered, sliceable set. Prior to 0.18.0, the Int64Index would provide the default index for all NDFrame objects.

RangeIndex is a sub-class of Int64Index added in version 0.18.0, now providing the default index for all NDFrame objects. RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are analogous to Python range types.

Float64Index

By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation. This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and slicing work exactly the same.

  1. In [142]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])
  2. In [143]: indexf
  3. Out[143]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
  4. In [144]: sf = pd.Series(range(5), index=indexf)
  5. In [145]: sf
  6. Out[145]:
  7. 1.5 0
  8. 2.0 1
  9. 3.0 2
  10. 4.5 3
  11. 5.0 4
  12. dtype: int64

Scalar selection for [],.loc will always be label based. An integer will match an equal float index (e.g. 3 is equivalent to 3.0).

  1. In [146]: sf[3]
  2. Out[146]: 2
  3. In [147]: sf[3.0]
  4. Out[147]: 2
  5. In [148]: sf.loc[3]
  6. Out[148]: 2
  7. In [149]: sf.loc[3.0]
  8. Out[149]: 2

The only positional indexing is via iloc.

  1. In [150]: sf.iloc[3]
  2. Out[150]: 3

A scalar index that is not found will raise a KeyError. Slicing is primarily on the values of the index when using [],ix,loc, and always positional when using iloc. The exception is when the slice is boolean, in which case it will always be positional.

  1. In [151]: sf[2:4]
  2. Out[151]:
  3. 2.0 1
  4. 3.0 2
  5. dtype: int64
  6. In [152]: sf.loc[2:4]
  7. Out[152]:
  8. 2.0 1
  9. 3.0 2
  10. dtype: int64
  11. In [153]: sf.iloc[2:4]
  12. Out[153]:
  13. 3.0 2
  14. 4.5 3
  15. dtype: int64

In float indexes, slicing using floats is allowed.

  1. In [154]: sf[2.1:4.6]
  2. Out[154]:
  3. 3.0 2
  4. 4.5 3
  5. dtype: int64
  6. In [155]: sf.loc[2.1:4.6]
  7. Out[155]:
  8. 3.0 2
  9. 4.5 3
  10. dtype: int64

In non-float indexes, slicing using floats will raise a TypeError.

  1. In [1]: pd.Series(range(5))[3.5]
  2. TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
  3. In [1]: pd.Series(range(5))[3.5:4.5]
  4. TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

警告

Using a scalar float indexer for .iloc has been removed in 0.18.0, so the following will raise a TypeError:

  1. In [3]: pd.Series(range(5)).iloc[3.0]
  2. TypeError: cannot do positional indexing on < class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of < type 'float'>

Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could for example be millisecond offsets.

  1. In [156]: dfir = pd.concat([pd.DataFrame(np.random.randn(5,2),
  2. .....: index=np.arange(5) * 250.0,
  3. .....: columns=list('AB')),
  4. .....: pd.DataFrame(np.random.randn(6,2),
  5. .....: index=np.arange(4,10) * 250.1,
  6. .....: columns=list('AB'))])
  7. .....:
  8. In [157]: dfir
  9. Out[157]:
  10. A B
  11. 0.0 0.997289 -1.693316
  12. 250.0 -0.179129 -1.598062
  13. 500.0 0.936914 0.912560
  14. 750.0 -1.003401 1.632781
  15. 1000.0 -0.724626 0.178219
  16. 1000.4 0.310610 -0.108002
  17. 1250.5 -0.974226 -1.147708
  18. 1500.6 -2.281374 0.760010
  19. 1750.7 -0.742532 1.533318
  20. 2000.8 2.495362 -0.432771
  21. 2250.9 -0.068954 0.043520

Selection operations then will always work on a value basis, for all selection operators.

  1. In [158]: dfir[0:1000.4]
  2. Out[158]:
  3. A B
  4. 0.0 0.997289 -1.693316
  5. 250.0 -0.179129 -1.598062
  6. 500.0 0.936914 0.912560
  7. 750.0 -1.003401 1.632781
  8. 1000.0 -0.724626 0.178219
  9. 1000.4 0.310610 -0.108002
  10. In [159]: dfir.loc[0:1001,'A']
  11. Out[159]:
  12. 0.0 0.997289
  13. 250.0 -0.179129
  14. 500.0 0.936914
  15. 750.0 -1.003401
  16. 1000.0 -0.724626
  17. 1000.4 0.310610
  18. Name: A, dtype: float64
  19. In [160]: dfir.loc[1000.4]
  20. Out[160]:
  21. A 0.310610
  22. B -0.108002
  23. Name: 1000.4, dtype: float64

You could retrieve the first 1 second (1000 ms) of data as such:

  1. In [161]: dfir[0:1000]
  2. Out[161]:
  3. A B
  4. 0.0 0.997289 -1.693316
  5. 250.0 -0.179129 -1.598062
  6. 500.0 0.936914 0.912560
  7. 750.0 -1.003401 1.632781
  8. 1000.0 -0.724626 0.178219

If you need integer based selection, you should use iloc:

  1. In [162]: dfir.iloc[0:5]
  2. Out[162]:
  3. A B
  4. 0.0 0.997289 -1.693316
  5. 250.0 -0.179129 -1.598062
  6. 500.0 0.936914 0.912560
  7. 750.0 -1.003401 1.632781
  8. 1000.0 -0.724626 0.178219

IntervalIndex

New in version 0.20.0.

IntervalIndex together with its own dtype, interval as well as the Interval scalar type, allow first-class support in pandas for interval notation.

The IntervalIndex allows some unique indexing and is also used as a return type for the categories in cut() and qcut().

警告

These indexing behaviors are provisional and may change in a future version of pandas.

An IntervalIndex can be used in Series and in DataFrame as the index.

  1. In [163]: df = pd.DataFrame({'A': [1, 2, 3, 4]},
  2. .....: index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
  3. .....:
  4. In [164]: df
  5. Out[164]:
  6. A
  7. (0, 1] 1
  8. (1, 2] 2
  9. (2, 3] 3
  10. (3, 4] 4

Label based indexing via .loc along the edges of an interval works as you would expect, selecting that particular interval.

  1. In [165]: df.loc[2]
  2. Out[165]:
  3. A 2
  4. Name: (1, 2], dtype: int64
  5. In [166]: df.loc[[2, 3]]
  6. Out[166]:
  7. A
  8. (1, 2] 2
  9. (2, 3] 3

If you select a label contained within an interval, this will also select the interval.

  1. In [167]: df.loc[2.5]
  2. Out[167]:
  3. A 3
  4. Name: (2, 3], dtype: int64
  5. In [168]: df.loc[[2.5, 3.5]]
  6. Out[168]:
  7. A
  8. (2, 3] 3
  9. (3, 4] 4

Interval and IntervalIndex are used by cut and qcut:

  1. In [169]: c = pd.cut(range(4), bins=2)
  2. In [170]: c
  3. Out[170]:
  4. [(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
  5. Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
  6. In [171]: c.categories
  7. Out[171]:
  8. IntervalIndex([(-0.003, 1.5], (1.5, 3.0]]
  9. closed='right',
  10. dtype='interval[float64]')

Furthermore, IntervalIndex allows one to bin other data with these same bins, with NaN representing a missing value similar to other dtypes.

  1. In [172]: pd.cut([0, 3, 5, 1], bins=c.categories)
  2. Out[172]:
  3. [(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
  4. Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

Generating Ranges of Intervals

If we need intervals on a regular frequency, we can use the interval_range() function to create an IntervalIndex using various combinations of start, end, and periods. The default frequency for interval_range is a 1 for numeric intervals, and calendar day for datetime-like intervals:

  1. In [173]: pd.interval_range(start=0, end=5)
  2. Out[173]:
  3. IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]]
  4. closed='right',
  5. dtype='interval[int64]')
  6. In [174]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4)
  7. Out[174]:
  8. IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]]
  9. closed='right',
  10. dtype='interval[datetime64[ns]]')
  11. In [175]: pd.interval_range(end=pd.Timedelta('3 days'), periods=3)
  12. Out[175]:
  13. IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]]
  14. closed='right',
  15. dtype='interval[timedelta64[ns]]')

The freq parameter can used to specify non-default frequencies, and can utilize a variety of frequency aliases with datetime-like intervals:

  1. In [176]: pd.interval_range(start=0, periods=5, freq=1.5)
  2. Out[176]:
  3. IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]]
  4. closed='right',
  5. dtype='interval[float64]')
  6. In [177]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4, freq='W')
  7. Out[177]:
  8. IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]]
  9. closed='right',
  10. dtype='interval[datetime64[ns]]')
  11. In [178]: pd.interval_range(start=pd.Timedelta('0 days'), periods=3, freq='9H')
  12. Out[178]:
  13. IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]]
  14. closed='right',
  15. dtype='interval[timedelta64[ns]]')

Additionally, the closed parameter can be used to specify which side(s) the intervals are closed on. Intervals are closed on the right side by default.

  1. In [179]: pd.interval_range(start=0, end=4, closed='both')
  2. Out[179]:
  3. IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]]
  4. closed='both',
  5. dtype='interval[int64]')
  6. In [180]: pd.interval_range(start=0, end=4, closed='neither')
  7. Out[180]:
  8. IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)]
  9. closed='neither',
  10. dtype='interval[int64]')

New in version 0.23.0.

Specifying start, end, and periods will generate a range of evenly spaced intervals from start to end inclusively, with periods number of elements in the resulting IntervalIndex:

  1. In [181]: pd.interval_range(start=0, end=6, periods=4)
  2. Out[181]:
  3. IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]]
  4. closed='right',
  5. dtype='interval[float64]')
  6. In [182]: pd.interval_range(pd.Timestamp('2018-01-01'), pd.Timestamp('2018-02-28'), periods=3)
  7. Out[182]:
  8. IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]]
  9. closed='right',
  10. dtype='interval[datetime64[ns]]')