索引对象

The pandas Index class and its subclasses can be viewed as implementing an ordered multiset. Duplicates are allowed. However, if you try to convert an Index object with duplicate entries into a set, an exception will be raised.

Index also provides the infrastructure necessary for lookups, data alignment, and reindexing. The easiest way to create an Index directly is to pass a list or other sequence to Index:

  1. In [289]: index = pd.Index(['e', 'd', 'a', 'b'])
  2. In [290]: index
  3. Out[290]: Index(['e', 'd', 'a', 'b'], dtype='object')
  4. In [291]: 'd' in index
  5. Out[291]: True

You can also pass a name to be stored in the index:

  1. In [292]: index = pd.Index(['e', 'd', 'a', 'b'], name='something')
  2. In [293]: index.name
  3. Out[293]: 'something'

The name, if set, will be shown in the console display:

  1. In [294]: index = pd.Index(list(range(5)), name='rows')
  2. In [295]: columns = pd.Index(['A', 'B', 'C'], name='cols')
  3. In [296]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)
  4. In [297]: df
  5. Out[297]:
  6. cols A B C
  7. rows
  8. 0 1.295989 0.185778 0.436259
  9. 1 0.678101 0.311369 -0.528378
  10. 2 -0.674808 -1.103529 -0.656157
  11. 3 1.889957 2.076651 -1.102192
  12. 4 -1.211795 -0.791746 0.634724
  13. In [298]: df['A']
  14. Out[298]:
  15. rows
  16. 0 1.295989
  17. 1 0.678101
  18. 2 -0.674808
  19. 3 1.889957
  20. 4 -1.211795
  21. Name: A, dtype: float64

Setting metadata

Indexes are “mostly immutable”, but it is possible to set and change their metadata, like the index name (or, for MultiIndex, levels and labels).

You can use the rename, set_names, set_levels, and set_labels to set these attributes directly. They default to returning a copy; however, you can specify inplace=True to have the data change in place.

See Advanced Indexing for usage of MultiIndexes.

  1. In [299]: ind = pd.Index([1, 2, 3])
  2. In [300]: ind.rename("apple")
  3. Out[300]: Int64Index([1, 2, 3], dtype='int64', name='apple')
  4. In [301]: ind
  5. Out[301]: Int64Index([1, 2, 3], dtype='int64')
  6. In [302]: ind.set_names(["apple"], inplace=True)
  7. In [303]: ind.name = "bob"
  8. In [304]: ind
  9. Out[304]: Int64Index([1, 2, 3], dtype='int64', name='bob')

set_names, set_levels, and set_labels also take an optional level` argument

  1. In [305]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
  2. In [306]: index
  3. Out[306]:
  4. MultiIndex(levels=[[0, 1, 2], ['one', 'two']],
  5. labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
  6. names=['first', 'second'])
  7. In [307]: index.levels[1]
  8. Out[307]: Index(['one', 'two'], dtype='object', name='second')
  9. In [308]: index.set_levels(["a", "b"], level=1)
  10. Out[308]:
  11. MultiIndex(levels=[[0, 1, 2], ['a', 'b']],
  12. labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
  13. names=['first', 'second'])

Set operations on Index objects

The two main operations are union (|) and intersection (&). These can be directly called as instance methods or used via overloaded operators. Difference is provided via the .difference() method.

  1. In [309]: a = pd.Index(['c', 'b', 'a'])
  2. In [310]: b = pd.Index(['c', 'e', 'd'])
  3. In [311]: a | b
  4. Out[311]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
  5. In [312]: a & b
  6. Out[312]: Index(['c'], dtype='object')
  7. In [313]: a.difference(b)
  8. Out[313]: Index(['a', 'b'], dtype='object')

Also available is the symmetric_difference (^) operation, which returns elements that appear in either idx1 or idx2, but not in both. This is equivalent to the Index created by idx1.difference(idx2).union(idx2.difference(idx1)), with duplicates dropped.

  1. In [314]: idx1 = pd.Index([1, 2, 3, 4])
  2. In [315]: idx2 = pd.Index([2, 3, 4, 5])
  3. In [316]: idx1.symmetric_difference(idx2)
  4. Out[316]: Int64Index([1, 5], dtype='int64')
  5. In [317]: idx1 ^ idx2
  6. Out[317]: Int64Index([1, 5], dtype='int64')

Note: The resulting index from a set operation will be sorted in ascending order.

Missing values

Important: Even though Index can hold missing values (NaN), it should be avoided if you do not want any unexpected results. For example, some operations exclude missing values implicitly.

Index.fillna fills missing values with specified scalar value.

  1. In [318]: idx1 = pd.Index([1, np.nan, 3, 4])
  2. In [319]: idx1
  3. Out[319]: Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')
  4. In [320]: idx1.fillna(2)
  5. Out[320]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
  6. In [321]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'), pd.NaT, pd.Timestamp('2011-01-03')])
  7. In [322]: idx2
  8. Out[322]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)
  9. In [323]: idx2.fillna(pd.Timestamp('2011-01-02'))
  10. Out[323]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)