分层索引(多索引)

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.

See the cookbook) for some advanced strategies.

Creating a MultiIndex (hierarchical index) object

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays), an array of tuples (using MultiIndex.from_tuples), or a crossed set of iterables (using MultiIndex.from_product). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

  1. In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
  2. ...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
  3. ...:
  4. In [2]: tuples = list(zip(*arrays))
  5. In [3]: tuples
  6. Out[3]:
  7. [('bar', 'one'),
  8. ('bar', 'two'),
  9. ('baz', 'one'),
  10. ('baz', 'two'),
  11. ('foo', 'one'),
  12. ('foo', 'two'),
  13. ('qux', 'one'),
  14. ('qux', 'two')]
  15. In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
  16. In [5]: index
  17. Out[5]:
  18. MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
  19. labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
  20. names=['first', 'second'])
  21. In [6]: s = pd.Series(np.random.randn(8), index=index)
  22. In [7]: s
  23. Out[7]:
  24. first second
  25. bar one 0.469112
  26. two -0.282863
  27. baz one -1.509059
  28. two -1.135632
  29. foo one 1.212112
  30. two -0.173215
  31. qux one 0.119209
  32. two -1.044236
  33. dtype: float64

When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product function:

  1. In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
  2. In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])
  3. Out[9]:
  4. MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
  5. labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
  6. names=['first', 'second'])

As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

  1. In [10]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
  2. ....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
  3. ....:
  4. In [11]: s = pd.Series(np.random.randn(8), index=arrays)
  5. In [12]: s
  6. Out[12]:
  7. bar one -0.861849
  8. two -2.104569
  9. baz one -0.494929
  10. two 1.071804
  11. foo one 0.721555
  12. two -0.706771
  13. qux one -1.039575
  14. two 0.271860
  15. dtype: float64
  16. In [13]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
  17. In [14]: df
  18. Out[14]:
  19. 0 1 2 3
  20. bar one -0.424972 0.567020 0.276232 -1.087401
  21. two -0.673690 0.113648 -1.478427 0.524988
  22. baz one 0.404705 0.577046 -1.715002 -1.039268
  23. two -0.370647 -1.157892 -1.344312 0.844885
  24. foo one 1.075770 -0.109050 1.643563 -1.469388
  25. two 0.357021 -0.674600 -1.776904 -0.968914
  26. qux one -1.294524 0.413738 0.276662 -0.472035
  27. two -0.013960 -0.362543 -0.006154 -0.923061

All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

  1. In [15]: df.index.names
  2. Out[15]: FrozenList([None, None])

This index can back any axis of a pandas object, and the number of levels of the index is up to you:

  1. In [16]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
  2. In [17]: df
  3. Out[17]:
  4. first bar baz foo qux
  5. second one two one two one two one two
  6. A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
  7. B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737
  8. C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747
  9. In [18]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
  10. Out[18]:
  11. first bar baz foo
  12. second one two one two one two
  13. first second
  14. bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804
  15. two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734
  16. baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738
  17. two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
  18. foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232
  19. two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441

We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how the index is displayed can be controlled using the multi_sparse option in pandas.set_options():

  1. In [19]: with pd.option_context('display.multi_sparse', False):
  2. ....: df
  3. ....:

It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

  1. In [20]: pd.Series(np.random.randn(8), index=tuples)
  2. Out[20]:
  3. (bar, one) -1.236269
  4. (bar, two) 0.896171
  5. (baz, one) -0.487602
  6. (baz, two) -0.082240
  7. (foo, one) -2.182937
  8. (foo, two) 0.380396
  9. (qux, one) 0.084844
  10. (qux, two) 0.432390
  11. dtype: float64

The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating a MultiIndex explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.

Reconstructing the level labels

The method get_level_values will return a vector of the labels for each location at a particular level:

  1. In [21]: index.get_level_values(0)
  2. Out[21]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
  3. In [22]: index.get_level_values('second')
  4. Out[22]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

Basic indexing on axis with MultiIndex

One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

  1. In [23]: df['bar']
  2. Out[23]:
  3. second one two
  4. A 0.895717 0.805244
  5. B 0.410835 0.813850
  6. C -1.413681 1.607920
  7. In [24]: df['bar', 'one']
  8. Out[24]:
  9. A 0.895717
  10. B 0.410835
  11. C -1.413681
  12. Name: (bar, one), dtype: float64
  13. In [25]: df['bar']['one']
  14. Out[25]:
  15. A 0.895717
  16. B 0.410835
  17. C -1.413681
  18. Name: one, dtype: float64
  19. In [26]: s['qux']
  20. Out[26]:
  21. one -1.039575
  22. two 0.271860
  23. dtype: float64

See Cross-section with hierarchical index for how to select on a deeper level.

Defined Levels

The repr of a MultiIndex shows all the defined levels of an index, even if the they are not actually used. When slicing an index, you may notice this. For example:

  1. In [27]: df.columns # original MultiIndex
  2. Out[27]:
  3. MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
  4. labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
  5. names=['first', 'second'])
  6. In [28]: df[['foo','qux']].columns # sliced
  7. Out[28]:
  8. MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
  9. labels=[[2, 2, 3, 3], [0, 1, 0, 1]],
  10. names=['first', 'second'])

This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the MultiIndex.get_level_values() method.

  1. In [29]: df[['foo','qux']].columns.values
  2. Out[29]: array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], dtype=object)
  3. # for a specific level
  4. In [30]: df[['foo','qux']].columns.get_level_values(0)
  5. Out[30]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

To reconstruct the MultiIndex with only the used levels, the remove_unused_levels method may be used.

New in version 0.20.0.

  1. In [31]: df[['foo','qux']].columns.remove_unused_levels()
  2. Out[31]:
  3. MultiIndex(levels=[['foo', 'qux'], ['one', 'two']],
  4. labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
  5. names=['first', 'second'])

Data alignment and using reindex

Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

  1. In [32]: s + s[:-2]
  2. Out[32]:
  3. bar one -1.723698
  4. two -4.209138
  5. baz one -0.989859
  6. two 2.143608
  7. foo one 1.443110
  8. two -1.413542
  9. qux one NaN
  10. two NaN
  11. dtype: float64
  12. In [33]: s + s[::2]
  13. Out[33]:
  14. bar one -1.723698
  15. two NaN
  16. baz one -0.989859
  17. two NaN
  18. foo one 1.443110
  19. two NaN
  20. qux one -2.079150
  21. two NaN
  22. dtype: float64

reindex can be called with another MultiIndex, or even a list or array of tuples:

  1. In [34]: s.reindex(index[:3])
  2. Out[34]:
  3. first second
  4. bar one -0.861849
  5. two -2.104569
  6. baz one -0.494929
  7. dtype: float64
  8. In [35]: s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])
  9. Out[35]:
  10. foo two -0.706771
  11. bar one -0.861849
  12. qux one -1.039575
  13. baz one -0.494929
  14. dtype: float64