具有层次索引的高级索引

具有层次索引的高级索引

Syntactically integrating MultiIndex in advanced indexing with .loc is a bit challenging, but we’ve made every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would expect:

In [36]: df = df.T
In [37]: df
Out[37]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747
In [38]: df.loc[('bar', 'two'),]
Out[38]: 
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64

Note that df.loc['bar', 'two'] would also work in this example, but this shorthand notation can lead to ambiguity in general.

If you also want to index a specific column with .loc, you must use a tuple like this:

In [39]: df.loc[('bar', 'two'), 'A']
Out[39]: 0.80524402538637851

You don’t have to specify all levels of the MultiIndex by passing only the first elements of the tuple. For example, you can use “partial” indexing to get all elements with bar in the first level as follows:

df.loc[‘bar’]

This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',] in this example).

“Partial” slicing also works quite nicely.

In [40]: df.loc['baz':'foo']
Out[40]: 
                     A         B         C
first second                              
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

You can slice with a ‘range’ of values, by providing a slice of tuples.

In [41]: df.loc[('baz', 'two'):('qux', 'one')]
Out[41]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
In [42]: df.loc[('baz', 'two'):'foo']
Out[42]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

Passing a list of labels or tuples works similar to reindexing:

In [43]: df.loc[[('bar', 'two'), ('qux', 'one')]]
Out[43]: 
                     A         B         C
first second                              
bar   two     0.805244  0.813850  1.607920
qux   one    -1.170299  1.130127  0.974466

Note: It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).

Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to several values within a level:

In [44]: s = pd.Series([1, 2, 3, 4, 5, 6],
   ....:               index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))
   ....: 
In [45]: s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
Out[45]: 
A  c    1
B  d    5
dtype: int64
In [46]: s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
Out[46]: 
A  c    1
   d    2
B  c    4
   d    5
dtype: int64

Using slicers

You can slice a MultiIndex by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.

You can use slice(None) to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None).

As usual, both sides of the slicers are included as this is label indexing.

警告

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. There are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MultiIndex for the rows.

You should do this:


df.loc[(slice('A1','A3'),.....), :]

You should not do this:


df.loc[(slice('A1','A3'),.....)]

In [47]: def mklbl(prefix,n):
   ....:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ....: 
In [48]: miindex = pd.MultiIndex.from_product([mklbl('A',4),
   ....:                                       mklbl('B',2),
   ....:                                       mklbl('C',4),
   ....:                                       mklbl('D',2)])
   ....: 
In [49]: micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
   ....:                                        ('b','foo'),('b','bah')],
   ....:                                       names=['lvl0', 'lvl1'])
   ....: 
In [50]: dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),
   ....:                     index=miindex,
   ....:                     columns=micolumns).sort_index().sort_index(axis=1)
   ....: 
In [51]: dfmi
Out[51]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
...          ...  ...  ...  ...
A3 B1 C0 D1  229  228  231  230
      C1 D0  233  232  235  234
         D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254
[64 rows x 4 columns]

Basic multi-index slicing using slices, lists, and labels.

In [52]: dfmi.loc[(slice('A1','A3'), slice(None), ['C1', 'C3']), :]
Out[52]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
...          ...  ...  ...  ...
A3 B0 C1 D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254
[24 rows x 4 columns]

You can use pandas.IndexSlice to facilitate a more natural syntax using :, rather than using slice(None).

In [53]: idx = pd.IndexSlice
In [54]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
Out[54]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
         D1   44   46
      C3 D0   56   58
...          ...  ...
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254
[32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [55]: dfmi.loc['A1', (slice(None), 'foo')]
Out[55]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
      D1   84   86
   C3 D0   88   90
...       ...  ...
B1 C0 D1  100  102
   C1 D0  104  106
      D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
      D1  124  126
[16 rows x 2 columns]
In [56]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
Out[56]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
         D1   44   46
      C3 D0   56   58
...          ...  ...
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254
[32 rows x 2 columns]

Using a boolean indexer you can provide selection related to the values.

In [57]: mask = dfmi[('a', 'foo')] > 200
In [58]: dfmi.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']]
Out[58]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.

In [59]: dfmi.loc(axis=0)[:, :, ['C1', 'C3']]
Out[59]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
         D1   45   44   47   46
      C3 D0   57   56   59   58
...          ...  ...  ...  ...
A3 B0 C1 D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254
[32 rows x 4 columns]

Furthermore you can set the values using the following methods.

In [60]: df2 = dfmi.copy()
In [61]: df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10
In [62]: df2
Out[62]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0  -10  -10  -10  -10
...          ...  ...  ...  ...
A3 B1 C0 D1  229  228  231  230
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
[64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

In [63]: df2 = dfmi.copy()
In [64]: df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000
In [65]: df2
Out[65]: 
lvl0              a               b        
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
         D1       5       4       7       6
      C1 D0    9000    8000   11000   10000
         D1   13000   12000   15000   14000
      C2 D0      17      16      19      18
         D1      21      20      23      22
      C3 D0   25000   24000   27000   26000
...             ...     ...     ...     ...
A3 B1 C0 D1     229     228     231     230
      C1 D0  233000  232000  235000  234000
         D1  237000  236000  239000  238000
      C2 D0     241     240     243     242
         D1     245     244     247     246
      C3 D0  249000  248000  251000  250000
         D1  253000  252000  255000  254000
[64 rows x 4 columns]

Cross-section

The xs method of DataFrame additionally takes a level argument to make selecting data at a particular level of a MultiIndex easier.

In [66]: df
Out[66]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747
In [67]: df.xs('one', level='second')
Out[67]: 
              A         B         C
first                              
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466

# using the slicers
In [68]: df.loc[(slice(None),'one'),:]
Out[68]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466

You can also select on the columns with xs(), by providing the axis argument.

In [69]: df = df.T
In [70]: df.xs('one', level='second', axis=1)
Out[70]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466

# using the slicers
In [71]: df.loc[:,(slice(None),'one')]
Out[71]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

xs() also allows selection with multiple keys.

In [72]: df.xs(('one', 'bar'), level=('second', 'first'), axis=1)
Out[72]: 
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681
# using the slicers
In [73]: df.loc[:,('bar','one')]
Out[73]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

You can pass drop_level=False to xs() to retain the level that was selected.

In [74]: df.xs('one', level='second', axis=1, drop_level=False)
Out[74]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

Compare the above with the result using drop_level=True (the default value).

In [75]: df.xs('one', level='second', axis=1, drop_level=True)
Out[75]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466

Advanced reindexing and alignment

The parameter level has been added to the reindex and align methods of pandas objects. This is useful to broadcast values across a level. For instance:

In [76]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']],
   ....:                      labels=[[1,1,0,0],[1,0,1,0]])
   ....: 
In [77]: df = pd.DataFrame(np.random.randn(4,2), index=midx)
In [78]: df
Out[78]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520
In [79]: df2 = df.mean(level=0)
In [80]: df2
Out[80]: 
             0         1
one   1.060074 -0.109716
zero  1.271532  0.713416
In [81]: df2.reindex(df.index, level=0)
Out[81]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416
# aligning
In [82]: df_aligned, df2_aligned = df.align(df2, level=0)
In [83]: df_aligned
Out[83]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520
In [84]: df2_aligned
Out[84]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

Swapping levels with swaplevel()

The swaplevel function can switch the order of two levels:

In [85]: df[:5]
Out[85]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520
In [86]: df[:5].swaplevel(0, 1, axis=0)
Out[86]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

Reordering levels with reorder_levels()

The reorder_levels function generalizes the swaplevel function, allowing you to permute the hierarchical index levels in one step:

In [87]: df[:5].reorder_levels([1,0], axis=0)
Out[87]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520