MultiIndex / advanced indexing

This section covers indexing with a MultiIndexand other advanced indexing features.

See the Indexing and Selecting Data for general indexing documentation.

Warning

Whether a copy or a reference is returned for a setting operation maydepend on the context. This is sometimes called chained assignment andshould be avoided. See Returning a View versus Copy.

See the cookbook for some advanced strategies.

Hierarchical indexing (MultiIndex)

Hierarchical / Multi-level indexing is very exciting as it opens the door to somequite sophisticated data analysis and manipulation, especially for working withhigher dimensional data. In essence, it enables you to store and manipulatedata with an arbitrary number of dimensions in lower dimensional datastructures like Series (1d) and DataFrame (2d).

In this section, we will show what exactly we mean by “hierarchical” indexingand how it integrates with all of the pandas indexing functionalitydescribed above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll shownon-trivial applications to illustrate how it aids in structuring data foranalysis.

See the cookbook for some advanced strategies.

Changed in version 0.24.0: MultiIndex.labels has been renamed to MultiIndex.codesand MultiIndex.set_labels to MultiIndex.set_codes.

Creating a MultiIndex (hierarchical index) object

The MultiIndex object is the hierarchical analogue of the standardIndex object which typically stores the axis labels in pandas objects. Youcan think of MultiIndex as an array of tuples where each tuple is unique. AMultiIndex can be created from a list of arrays (usingMultiIndex.from_arrays()), an array of tuples (usingMultiIndex.from_tuples()), a crossed set of iterables (usingMultiIndex.from_product()), or a DataFrame (usingMultiIndex.from_frame()). The Index constructor will attempt to returna MultiIndex when it is passed a list of tuples. The following examplesdemonstrate different ways to initialize MultiIndexes.

  1. In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
  2. ...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
  3. ...:
  4.  
  5. In [2]: tuples = list(zip(*arrays))
  6.  
  7. In [3]: tuples
  8. Out[3]:
  9. [('bar', 'one'),
  10. ('bar', 'two'),
  11. ('baz', 'one'),
  12. ('baz', 'two'),
  13. ('foo', 'one'),
  14. ('foo', 'two'),
  15. ('qux', 'one'),
  16. ('qux', 'two')]
  17.  
  18. In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
  19.  
  20. In [5]: index
  21. Out[5]:
  22. MultiIndex([('bar', 'one'),
  23. ('bar', 'two'),
  24. ('baz', 'one'),
  25. ('baz', 'two'),
  26. ('foo', 'one'),
  27. ('foo', 'two'),
  28. ('qux', 'one'),
  29. ('qux', 'two')],
  30. names=['first', 'second'])
  31.  
  32. In [6]: s = pd.Series(np.random.randn(8), index=index)
  33.  
  34. In [7]: s
  35. Out[7]:
  36. first second
  37. bar one 0.469112
  38. two -0.282863
  39. baz one -1.509059
  40. two -1.135632
  41. foo one 1.212112
  42. two -0.173215
  43. qux one 0.119209
  44. two -1.044236
  45. dtype: float64

When you want every pairing of the elements in two iterables, it can be easierto use the MultiIndex.from_product() method:

  1. In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
  2.  
  3. In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])
  4. Out[9]:
  5. MultiIndex([('bar', 'one'),
  6. ('bar', 'two'),
  7. ('baz', 'one'),
  8. ('baz', 'two'),
  9. ('foo', 'one'),
  10. ('foo', 'two'),
  11. ('qux', 'one'),
  12. ('qux', 'two')],
  13. names=['first', 'second'])

You can also construct a MultiIndex from a DataFrame directly, usingthe method MultiIndex.from_frame(). This is a complementary method toMultiIndex.to_frame().

New in version 0.24.0.

  1. In [10]: df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
  2. ....: ['foo', 'one'], ['foo', 'two']],
  3. ....: columns=['first', 'second'])
  4. ....:
  5.  
  6. In [11]: pd.MultiIndex.from_frame(df)
  7. Out[11]:
  8. MultiIndex([('bar', 'one'),
  9. ('bar', 'two'),
  10. ('foo', 'one'),
  11. ('foo', 'two')],
  12. names=['first', 'second'])

As a convenience, you can pass a list of arrays directly into Series orDataFrame to construct a MultiIndex automatically:

  1. In [12]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
  2. ....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
  3. ....:
  4.  
  5. In [13]: s = pd.Series(np.random.randn(8), index=arrays)
  6.  
  7. In [14]: s
  8. Out[14]:
  9. bar one -0.861849
  10. two -2.104569
  11. baz one -0.494929
  12. two 1.071804
  13. foo one 0.721555
  14. two -0.706771
  15. qux one -1.039575
  16. two 0.271860
  17. dtype: float64
  18.  
  19. In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
  20.  
  21. In [16]: df
  22. Out[16]:
  23. 0 1 2 3
  24. bar one -0.424972 0.567020 0.276232 -1.087401
  25. two -0.673690 0.113648 -1.478427 0.524988
  26. baz one 0.404705 0.577046 -1.715002 -1.039268
  27. two -0.370647 -1.157892 -1.344312 0.844885
  28. foo one 1.075770 -0.109050 1.643563 -1.469388
  29. two 0.357021 -0.674600 -1.776904 -0.968914
  30. qux one -1.294524 0.413738 0.276662 -0.472035
  31. two -0.013960 -0.362543 -0.006154 -0.923061

All of the MultiIndex constructors accept a names argument which storesstring names for the levels themselves. If no names are provided, None willbe assigned:

  1. In [17]: df.index.names
  2. Out[17]: FrozenList([None, None])

This index can back any axis of a pandas object, and the number of levelsof the index is up to you:

  1. In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
  2.  
  3. In [19]: df
  4. Out[19]:
  5. first bar baz foo qux
  6. second one two one two one two one two
  7. A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
  8. B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737
  9. C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747
  10.  
  11. In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
  12. Out[20]:
  13. first bar baz foo
  14. second one two one two one two
  15. first second
  16. bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804
  17. two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734
  18. baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738
  19. two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
  20. foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232
  21. two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441

We’ve “sparsified” the higher levels of the indexes to make the console output abit easier on the eyes. Note that how the index is displayed can be controlled using themulti_sparse option in pandas.set_options():

  1. In [21]: with pd.option_context('display.multi_sparse', False):
  2. ....: df
  3. ....:

It’s worth keeping in mind that there’s nothing preventing you from usingtuples as atomic labels on an axis:

  1. In [22]: pd.Series(np.random.randn(8), index=tuples)
  2. Out[22]:
  3. (bar, one) -1.236269
  4. (bar, two) 0.896171
  5. (baz, one) -0.487602
  6. (baz, two) -0.082240
  7. (foo, one) -2.182937
  8. (foo, two) 0.380396
  9. (qux, one) 0.084844
  10. (qux, two) 0.432390
  11. dtype: float64

The reason that the MultiIndex matters is that it can allow you to dogrouping, selection, and reshaping operations as we will describe below and insubsequent areas of the documentation. As you will see in later sections, youcan find yourself working with hierarchically-indexed data without creating aMultiIndex explicitly yourself. However, when loading data from a file, youmay wish to generate your own MultiIndex when preparing the data set.

Reconstructing the level labels

The method get_level_values() will return a vector of the labels for eachlocation at a particular level:

  1. In [23]: index.get_level_values(0)
  2. Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
  3.  
  4. In [24]: index.get_level_values('second')
  5. Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

Basic indexing on axis with MultiIndex

One of the important features of hierarchical indexing is that you can selectdata by a “partial” label identifying a subgroup in the data. Partialselection “drops” levels of the hierarchical index in the result in acompletely analogous way to selecting a column in a regular DataFrame:

  1. In [25]: df['bar']
  2. Out[25]:
  3. second one two
  4. A 0.895717 0.805244
  5. B 0.410835 0.813850
  6. C -1.413681 1.607920
  7.  
  8. In [26]: df['bar', 'one']
  9. Out[26]:
  10. A 0.895717
  11. B 0.410835
  12. C -1.413681
  13. Name: (bar, one), dtype: float64
  14.  
  15. In [27]: df['bar']['one']
  16. Out[27]:
  17. A 0.895717
  18. B 0.410835
  19. C -1.413681
  20. Name: one, dtype: float64
  21.  
  22. In [28]: s['qux']
  23. Out[28]:
  24. one -1.039575
  25. two 0.271860
  26. dtype: float64

See Cross-section with hierarchical index for how to selecton a deeper level.

Defined levels

The MultiIndex keeps all the defined levels of an index, evenif they are not actually used. When slicing an index, you may notice this.For example:

  1. In [29]: df.columns.levels # original MultiIndex
  2. Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
  3.  
  4. In [30]: df[['foo','qux']].columns.levels # sliced
  5. Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

This is done to avoid a recomputation of the levels in order to make slicinghighly performant. If you want to see only the used levels, you can use theget_level_values() method.

  1. In [31]: df[['foo', 'qux']].columns.to_numpy()
  2. Out[31]:
  3. array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
  4. dtype=object)
  5.  
  6. # for a specific level
  7. In [32]: df[['foo', 'qux']].columns.get_level_values(0)
  8. Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

To reconstruct the MultiIndex with only the used levels, theremove_unused_levels() method may be used.

New in version 0.20.0.

  1. In [33]: new_mi = df[['foo', 'qux']].columns.remove_unused_levels()
  2.  
  3. In [34]: new_mi.levels
  4. Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

Data alignment and using reindex

Operations between differently-indexed objects having MultiIndex on theaxes will work as you expect; data alignment will work the same as an Index oftuples:

  1. In [35]: s + s[:-2]
  2. Out[35]:
  3. bar one -1.723698
  4. two -4.209138
  5. baz one -0.989859
  6. two 2.143608
  7. foo one 1.443110
  8. two -1.413542
  9. qux one NaN
  10. two NaN
  11. dtype: float64
  12.  
  13. In [36]: s + s[::2]
  14. Out[36]:
  15. bar one -1.723698
  16. two NaN
  17. baz one -0.989859
  18. two NaN
  19. foo one 1.443110
  20. two NaN
  21. qux one -2.079150
  22. two NaN
  23. dtype: float64

The reindex() method of Series/DataFrames can becalled with another MultiIndex, or even a list or array of tuples:

  1. In [37]: s.reindex(index[:3])
  2. Out[37]:
  3. first second
  4. bar one -0.861849
  5. two -2.104569
  6. baz one -0.494929
  7. dtype: float64
  8.  
  9. In [38]: s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])
  10. Out[38]:
  11. foo two -0.706771
  12. bar one -0.861849
  13. qux one -1.039575
  14. baz one -0.494929
  15. dtype: float64

Advanced indexing with hierarchical index

Syntactically integrating MultiIndex in advanced indexing with .loc is abit challenging, but we’ve made every effort to do so. In general, MultiIndexkeys take the form of tuples. For example, the following works as you would expect:

  1. In [39]: df = df.T
  2.  
  3. In [40]: df
  4. Out[40]:
  5. A B C
  6. first second
  7. bar one 0.895717 0.410835 -1.413681
  8. two 0.805244 0.813850 1.607920
  9. baz one -1.206412 0.132003 1.024180
  10. two 2.565646 -0.827317 0.569605
  11. foo one 1.431256 -0.076467 0.875906
  12. two 1.340309 -1.187678 -2.211372
  13. qux one -1.170299 1.130127 0.974466
  14. two -0.226169 -1.436737 -2.006747
  15.  
  16. In [41]: df.loc[('bar', 'two')]
  17. Out[41]:
  18. A 0.805244
  19. B 0.813850
  20. C 1.607920
  21. Name: (bar, two), dtype: float64

Note that df.loc['bar', 'two'] would also work in this example, but this shorthandnotation can lead to ambiguity in general.

If you also want to index a specific column with .loc, you must use a tuplelike this:

  1. In [42]: df.loc[('bar', 'two'), 'A']
  2. Out[42]: 0.8052440253863785

You don’t have to specify all levels of the MultiIndex by passing only thefirst elements of the tuple. For example, you can use “partial” indexing toget all elements with bar in the first level as follows:

df.loc[‘bar’]

This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalentto df.loc['bar',] in this example).

“Partial” slicing also works quite nicely.

  1. In [43]: df.loc['baz':'foo']
  2. Out[43]:
  3. A B C
  4. first second
  5. baz one -1.206412 0.132003 1.024180
  6. two 2.565646 -0.827317 0.569605
  7. foo one 1.431256 -0.076467 0.875906
  8. two 1.340309 -1.187678 -2.211372

You can slice with a ‘range’ of values, by providing a slice of tuples.

  1. In [44]: df.loc[('baz', 'two'):('qux', 'one')]
  2. Out[44]:
  3. A B C
  4. first second
  5. baz two 2.565646 -0.827317 0.569605
  6. foo one 1.431256 -0.076467 0.875906
  7. two 1.340309 -1.187678 -2.211372
  8. qux one -1.170299 1.130127 0.974466
  9.  
  10. In [45]: df.loc[('baz', 'two'):'foo']
  11. Out[45]:
  12. A B C
  13. first second
  14. baz two 2.565646 -0.827317 0.569605
  15. foo one 1.431256 -0.076467 0.875906
  16. two 1.340309 -1.187678 -2.211372

Passing a list of labels or tuples works similar to reindexing:

  1. In [46]: df.loc[[('bar', 'two'), ('qux', 'one')]]
  2. Out[46]:
  3. A B C
  4. first second
  5. bar two 0.805244 0.813850 1.607920
  6. qux one -1.170299 1.130127 0.974466

Note

It is important to note that tuples and lists are not treated identicallyin pandas when it comes to indexing. Whereas a tuple is interpreted as onemulti-level key, a list is used to specify several keys. Or in other words,tuples go horizontally (traversing levels), lists go vertically (scanning levels).

Importantly, a list of tuples indexes several complete MultiIndex keys,whereas a tuple of lists refer to several values within a level:

  1. In [47]: s = pd.Series([1, 2, 3, 4, 5, 6],
  2. ....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))
  3. ....:
  4.  
  5. In [48]: s.loc[[("A", "c"), ("B", "d")]] # list of tuples
  6. Out[48]:
  7. A c 1
  8. B d 5
  9. dtype: int64
  10.  
  11. In [49]: s.loc[(["A", "B"], ["c", "d"])] # tuple of lists
  12. Out[49]:
  13. A c 1
  14. d 2
  15. B c 4
  16. d 5
  17. dtype: int64

Using slicers

You can slice a MultiIndex by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see Selection by Label,including slices, lists of labels, labels, and boolean indexers.

You can use slice(None) to select all the contents of that level. You do not need to specify all thedeeper levels, they will be implied as slice(None).

As usual, both sides of the slicers are included as this is label indexing.

Warning

You should specify all axes in the .loc specifier, meaning the indexer for the index andfor the columns. There are some ambiguous cases where the passed indexer could be mis-interpretedas indexing both axes, rather than into say the MultiIndex for the rows.

You should do this:

  1. df.loc[(slice('A1', 'A3'), ...), :] # noqa: E999

You should not do this:

  1. df.loc[(slice('A1', 'A3'), ...)] # noqa: E999
  1. In [50]: def mklbl(prefix, n):
  2. ....: return ["%s%s" % (prefix, i) for i in range(n)]
  3. ....:
  4.  
  5. In [51]: miindex = pd.MultiIndex.from_product([mklbl('A', 4),
  6. ....: mklbl('B', 2),
  7. ....: mklbl('C', 4),
  8. ....: mklbl('D', 2)])
  9. ....:
  10.  
  11. In [52]: micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
  12. ....: ('b', 'foo'), ('b', 'bah')],
  13. ....: names=['lvl0', 'lvl1'])
  14. ....:
  15.  
  16. In [53]: dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
  17. ....: .reshape((len(miindex), len(micolumns))),
  18. ....: index=miindex,
  19. ....: columns=micolumns).sort_index().sort_index(axis=1)
  20. ....:
  21.  
  22. In [54]: dfmi
  23. Out[54]:
  24. lvl0 a b
  25. lvl1 bar foo bah foo
  26. A0 B0 C0 D0 1 0 3 2
  27. D1 5 4 7 6
  28. C1 D0 9 8 11 10
  29. D1 13 12 15 14
  30. C2 D0 17 16 19 18
  31. ... ... ... ... ...
  32. A3 B1 C1 D1 237 236 239 238
  33. C2 D0 241 240 243 242
  34. D1 245 244 247 246
  35. C3 D0 249 248 251 250
  36. D1 253 252 255 254
  37.  
  38. [64 rows x 4 columns]

Basic MultiIndex slicing using slices, lists, and labels.

  1. In [55]: dfmi.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :]
  2. Out[55]:
  3. lvl0 a b
  4. lvl1 bar foo bah foo
  5. A1 B0 C1 D0 73 72 75 74
  6. D1 77 76 79 78
  7. C3 D0 89 88 91 90
  8. D1 93 92 95 94
  9. B1 C1 D0 105 104 107 106
  10. ... ... ... ... ...
  11. A3 B0 C3 D1 221 220 223 222
  12. B1 C1 D0 233 232 235 234
  13. D1 237 236 239 238
  14. C3 D0 249 248 251 250
  15. D1 253 252 255 254
  16.  
  17. [24 rows x 4 columns]

You can use pandas.IndexSlice to facilitate a more natural syntaxusing :, rather than using slice(None).

  1. In [56]: idx = pd.IndexSlice
  2.  
  3. In [57]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
  4. Out[57]:
  5. lvl0 a b
  6. lvl1 foo foo
  7. A0 B0 C1 D0 8 10
  8. D1 12 14
  9. C3 D0 24 26
  10. D1 28 30
  11. B1 C1 D0 40 42
  12. ... ... ...
  13. A3 B0 C3 D1 220 222
  14. B1 C1 D0 232 234
  15. D1 236 238
  16. C3 D0 248 250
  17. D1 252 254
  18.  
  19. [32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multipleaxes at the same time.

  1. In [58]: dfmi.loc['A1', (slice(None), 'foo')]
  2. Out[58]:
  3. lvl0 a b
  4. lvl1 foo foo
  5. B0 C0 D0 64 66
  6. D1 68 70
  7. C1 D0 72 74
  8. D1 76 78
  9. C2 D0 80 82
  10. ... ... ...
  11. B1 C1 D1 108 110
  12. C2 D0 112 114
  13. D1 116 118
  14. C3 D0 120 122
  15. D1 124 126
  16.  
  17. [16 rows x 2 columns]
  18.  
  19. In [59]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
  20. Out[59]:
  21. lvl0 a b
  22. lvl1 foo foo
  23. A0 B0 C1 D0 8 10
  24. D1 12 14
  25. C3 D0 24 26
  26. D1 28 30
  27. B1 C1 D0 40 42
  28. ... ... ...
  29. A3 B0 C3 D1 220 222
  30. B1 C1 D0 232 234
  31. D1 236 238
  32. C3 D0 248 250
  33. D1 252 254
  34.  
  35. [32 rows x 2 columns]

Using a boolean indexer you can provide selection related to the values.

  1. In [60]: mask = dfmi[('a', 'foo')] > 200
  2.  
  3. In [61]: dfmi.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']]
  4. Out[61]:
  5. lvl0 a b
  6. lvl1 foo foo
  7. A3 B0 C1 D1 204 206
  8. C3 D0 216 218
  9. D1 220 222
  10. B1 C1 D0 232 234
  11. D1 236 238
  12. C3 D0 248 250
  13. D1 252 254

You can also specify the axis argument to .loc to interpret the passedslicers on a single axis.

  1. In [62]: dfmi.loc(axis=0)[:, :, ['C1', 'C3']]
  2. Out[62]:
  3. lvl0 a b
  4. lvl1 bar foo bah foo
  5. A0 B0 C1 D0 9 8 11 10
  6. D1 13 12 15 14
  7. C3 D0 25 24 27 26
  8. D1 29 28 31 30
  9. B1 C1 D0 41 40 43 42
  10. ... ... ... ... ...
  11. A3 B0 C3 D1 221 220 223 222
  12. B1 C1 D0 233 232 235 234
  13. D1 237 236 239 238
  14. C3 D0 249 248 251 250
  15. D1 253 252 255 254
  16.  
  17. [32 rows x 4 columns]

Furthermore, you can set the values using the following methods.

  1. In [63]: df2 = dfmi.copy()
  2.  
  3. In [64]: df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10
  4.  
  5. In [65]: df2
  6. Out[65]:
  7. lvl0 a b
  8. lvl1 bar foo bah foo
  9. A0 B0 C0 D0 1 0 3 2
  10. D1 5 4 7 6
  11. C1 D0 -10 -10 -10 -10
  12. D1 -10 -10 -10 -10
  13. C2 D0 17 16 19 18
  14. ... ... ... ... ...
  15. A3 B1 C1 D1 -10 -10 -10 -10
  16. C2 D0 241 240 243 242
  17. D1 245 244 247 246
  18. C3 D0 -10 -10 -10 -10
  19. D1 -10 -10 -10 -10
  20.  
  21. [64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

  1. In [66]: df2 = dfmi.copy()
  2.  
  3. In [67]: df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000
  4.  
  5. In [68]: df2
  6. Out[68]:
  7. lvl0 a b
  8. lvl1 bar foo bah foo
  9. A0 B0 C0 D0 1 0 3 2
  10. D1 5 4 7 6
  11. C1 D0 9000 8000 11000 10000
  12. D1 13000 12000 15000 14000
  13. C2 D0 17 16 19 18
  14. ... ... ... ... ...
  15. A3 B1 C1 D1 237000 236000 239000 238000
  16. C2 D0 241 240 243 242
  17. D1 245 244 247 246
  18. C3 D0 249000 248000 251000 250000
  19. D1 253000 252000 255000 254000
  20.  
  21. [64 rows x 4 columns]

Cross-section

The xs() method of DataFrame additionally takes a level argument to makeselecting data at a particular level of a MultiIndex easier.

  1. In [69]: df
  2. Out[69]:
  3. A B C
  4. first second
  5. bar one 0.895717 0.410835 -1.413681
  6. two 0.805244 0.813850 1.607920
  7. baz one -1.206412 0.132003 1.024180
  8. two 2.565646 -0.827317 0.569605
  9. foo one 1.431256 -0.076467 0.875906
  10. two 1.340309 -1.187678 -2.211372
  11. qux one -1.170299 1.130127 0.974466
  12. two -0.226169 -1.436737 -2.006747
  13.  
  14. In [70]: df.xs('one', level='second')
  15. Out[70]:
  16. A B C
  17. first
  18. bar 0.895717 0.410835 -1.413681
  19. baz -1.206412 0.132003 1.024180
  20. foo 1.431256 -0.076467 0.875906
  21. qux -1.170299 1.130127 0.974466
  1. # using the slicers
  2. In [71]: df.loc[(slice(None), 'one'), :]
  3. Out[71]:
  4. A B C
  5. first second
  6. bar one 0.895717 0.410835 -1.413681
  7. baz one -1.206412 0.132003 1.024180
  8. foo one 1.431256 -0.076467 0.875906
  9. qux one -1.170299 1.130127 0.974466

You can also select on the columns with xs, byproviding the axis argument.

  1. In [72]: df = df.T
  2.  
  3. In [73]: df.xs('one', level='second', axis=1)
  4. Out[73]:
  5. first bar baz foo qux
  6. A 0.895717 -1.206412 1.431256 -1.170299
  7. B 0.410835 0.132003 -0.076467 1.130127
  8. C -1.413681 1.024180 0.875906 0.974466
  1. # using the slicers
  2. In [74]: df.loc[:, (slice(None), 'one')]
  3. Out[74]:
  4. first bar baz foo qux
  5. second one one one one
  6. A 0.895717 -1.206412 1.431256 -1.170299
  7. B 0.410835 0.132003 -0.076467 1.130127
  8. C -1.413681 1.024180 0.875906 0.974466

xs also allows selection with multiple keys.

  1. In [75]: df.xs(('one', 'bar'), level=('second', 'first'), axis=1)
  2. Out[75]:
  3. first bar
  4. second one
  5. A 0.895717
  6. B 0.410835
  7. C -1.413681
  1. # using the slicers
  2. In [76]: df.loc[:, ('bar', 'one')]
  3. Out[76]:
  4. A 0.895717
  5. B 0.410835
  6. C -1.413681
  7. Name: (bar, one), dtype: float64

You can pass drop_level=False to xs to retainthe level that was selected.

  1. In [77]: df.xs('one', level='second', axis=1, drop_level=False)
  2. Out[77]:
  3. first bar baz foo qux
  4. second one one one one
  5. A 0.895717 -1.206412 1.431256 -1.170299
  6. B 0.410835 0.132003 -0.076467 1.130127
  7. C -1.413681 1.024180 0.875906 0.974466

Compare the above with the result using drop_level=True (the default value).

  1. In [78]: df.xs('one', level='second', axis=1, drop_level=True)
  2. Out[78]:
  3. first bar baz foo qux
  4. A 0.895717 -1.206412 1.431256 -1.170299
  5. B 0.410835 0.132003 -0.076467 1.130127
  6. C -1.413681 1.024180 0.875906 0.974466

Advanced reindexing and alignment

Using the parameter level in the reindex() andalign() methods of pandas objects is useful to broadcastvalues across a level. For instance:

  1. In [79]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x', 'y']],
  2. ....: codes=[[1, 1, 0, 0], [1, 0, 1, 0]])
  3. ....:
  4.  
  5. In [80]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)
  6.  
  7. In [81]: df
  8. Out[81]:
  9. 0 1
  10. one y 1.519970 -0.493662
  11. x 0.600178 0.274230
  12. zero y 0.132885 -0.023688
  13. x 2.410179 1.450520
  14.  
  15. In [82]: df2 = df.mean(level=0)
  16.  
  17. In [83]: df2
  18. Out[83]:
  19. 0 1
  20. one 1.060074 -0.109716
  21. zero 1.271532 0.713416
  22.  
  23. In [84]: df2.reindex(df.index, level=0)
  24. Out[84]:
  25. 0 1
  26. one y 1.060074 -0.109716
  27. x 1.060074 -0.109716
  28. zero y 1.271532 0.713416
  29. x 1.271532 0.713416
  30.  
  31. # aligning
  32. In [85]: df_aligned, df2_aligned = df.align(df2, level=0)
  33.  
  34. In [86]: df_aligned
  35. Out[86]:
  36. 0 1
  37. one y 1.519970 -0.493662
  38. x 0.600178 0.274230
  39. zero y 0.132885 -0.023688
  40. x 2.410179 1.450520
  41.  
  42. In [87]: df2_aligned
  43. Out[87]:
  44. 0 1
  45. one y 1.060074 -0.109716
  46. x 1.060074 -0.109716
  47. zero y 1.271532 0.713416
  48. x 1.271532 0.713416

Swapping levels with swaplevel

The swaplevel() method can switch the order of two levels:

  1. In [88]: df[:5]
  2. Out[88]:
  3. 0 1
  4. one y 1.519970 -0.493662
  5. x 0.600178 0.274230
  6. zero y 0.132885 -0.023688
  7. x 2.410179 1.450520
  8.  
  9. In [89]: df[:5].swaplevel(0, 1, axis=0)
  10. Out[89]:
  11. 0 1
  12. y one 1.519970 -0.493662
  13. x one 0.600178 0.274230
  14. y zero 0.132885 -0.023688
  15. x zero 2.410179 1.450520

Reordering levels with reorder_levels

The reorder_levels() method generalizes the swaplevelmethod, allowing you to permute the hierarchical index levels in one step:

  1. In [90]: df[:5].reorder_levels([1, 0], axis=0)
  2. Out[90]:
  3. 0 1
  4. y one 1.519970 -0.493662
  5. x one 0.600178 0.274230
  6. y zero 0.132885 -0.023688
  7. x zero 2.410179 1.450520

Renaming names of an Index or MultiIndex

The rename() method is used to rename the labels of aMultiIndex, and is typically used to rename the columns of a DataFrame.The columns argument of rename allows a dictionary to be specifiedthat includes only the columns you wish to rename.

  1. In [91]: df.rename(columns={0: "col0", 1: "col1"})
  2. Out[91]:
  3. col0 col1
  4. one y 1.519970 -0.493662
  5. x 0.600178 0.274230
  6. zero y 0.132885 -0.023688
  7. x 2.410179 1.450520

This method can also be used to rename specific labels of the main indexof the DataFrame.

  1. In [92]: df.rename(index={"one": "two", "y": "z"})
  2. Out[92]:
  3. 0 1
  4. two z 1.519970 -0.493662
  5. x 0.600178 0.274230
  6. zero z 0.132885 -0.023688
  7. x 2.410179 1.450520

The rename_axis() method is used to rename the name of aIndex or MultiIndex. In particular, the names of the levels of aMultiIndex can be specified, which is useful if reset_index() is laterused to move the values from the MultiIndex to a column.

  1. In [93]: df.rename_axis(index=['abc', 'def'])
  2. Out[93]:
  3. 0 1
  4. abc def
  5. one y 1.519970 -0.493662
  6. x 0.600178 0.274230
  7. zero y 0.132885 -0.023688
  8. x 2.410179 1.450520

Note that the columns of a DataFrame are an index, so that usingrename_axis with the columns argument will change the name of thatindex.

  1. In [94]: df.rename_axis(columns="Cols").columns
  2. Out[94]: RangeIndex(start=0, stop=2, step=1, name='Cols')

Both rename and rename_axis support specifying a dictionary,Series or a mapping function to map labels/names to new values.

Sorting a MultiIndex

For MultiIndex-ed objects to be indexed and sliced effectively,they need to be sorted. As with any index, you can use sort_index().

  1. In [95]: import random
  2.  
  3. In [96]: random.shuffle(tuples)
  4.  
  5. In [97]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
  6.  
  7. In [98]: s
  8. Out[98]:
  9. bar two 0.206053
  10. baz one -0.251905
  11. foo one -2.213588
  12. two 1.063327
  13. qux one 1.266143
  14. two 0.299368
  15. bar one -0.863838
  16. baz two 0.408204
  17. dtype: float64
  18.  
  19. In [99]: s.sort_index()
  20. Out[99]:
  21. bar one -0.863838
  22. two 0.206053
  23. baz one -0.251905
  24. two 0.408204
  25. foo one -2.213588
  26. two 1.063327
  27. qux one 1.266143
  28. two 0.299368
  29. dtype: float64
  30.  
  31. In [100]: s.sort_index(level=0)
  32. Out[100]:
  33. bar one -0.863838
  34. two 0.206053
  35. baz one -0.251905
  36. two 0.408204
  37. foo one -2.213588
  38. two 1.063327
  39. qux one 1.266143
  40. two 0.299368
  41. dtype: float64
  42.  
  43. In [101]: s.sort_index(level=1)
  44. Out[101]:
  45. bar one -0.863838
  46. baz one -0.251905
  47. foo one -2.213588
  48. qux one 1.266143
  49. bar two 0.206053
  50. baz two 0.408204
  51. foo two 1.063327
  52. qux two 0.299368
  53. dtype: float64

You may also pass a level name to sort_index if the MultiIndex levelsare named.

  1. In [102]: s.index.set_names(['L1', 'L2'], inplace=True)
  2.  
  3. In [103]: s.sort_index(level='L1')
  4. Out[103]:
  5. L1 L2
  6. bar one -0.863838
  7. two 0.206053
  8. baz one -0.251905
  9. two 0.408204
  10. foo one -2.213588
  11. two 1.063327
  12. qux one 1.266143
  13. two 0.299368
  14. dtype: float64
  15.  
  16. In [104]: s.sort_index(level='L2')
  17. Out[104]:
  18. L1 L2
  19. bar one -0.863838
  20. baz one -0.251905
  21. foo one -2.213588
  22. qux one 1.266143
  23. bar two 0.206053
  24. baz two 0.408204
  25. foo two 1.063327
  26. qux two 0.299368
  27. dtype: float64

On higher dimensional objects, you can sort any of the other axes by level ifthey have a MultiIndex:

  1. In [105]: df.T.sort_index(level=1, axis=1)
  2. Out[105]:
  3. one zero one zero
  4. x x y y
  5. 0 0.600178 2.410179 1.519970 0.132885
  6. 1 0.274230 1.450520 -0.493662 -0.023688

Indexing will work even if the data are not sorted, but will be ratherinefficient (and show a PerformanceWarning). It will alsoreturn a copy of the data rather than a view:

  1. In [106]: dfm = pd.DataFrame({'jim': [0, 0, 1, 1],
  2. .....: 'joe': ['x', 'x', 'z', 'y'],
  3. .....: 'jolie': np.random.rand(4)})
  4. .....:
  5.  
  6. In [107]: dfm = dfm.set_index(['jim', 'joe'])
  7.  
  8. In [108]: dfm
  9. Out[108]:
  10. jolie
  11. jim joe
  12. 0 x 0.490671
  13. x 0.120248
  14. 1 z 0.537020
  15. y 0.110968
  1. In [4]: dfm.loc[(1, 'z')]
  2. PerformanceWarning: indexing past lexsort depth may impact performance.
  3.  
  4. Out[4]:
  5. jolie
  6. jim joe
  7. 1 z 0.64094

Furthermore, if you try to index something that is not fully lexsorted, this can raise:

  1. In [5]: dfm.loc[(0, 'y'):(1, 'z')]
  2. UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

The is_lexsorted() method on a MultiIndex shows if theindex is sorted, and the lexsort_depth property returns the sort depth:

  1. In [109]: dfm.index.is_lexsorted()
  2. Out[109]: False
  3.  
  4. In [110]: dfm.index.lexsort_depth
  5. Out[110]: 1
  1. In [111]: dfm = dfm.sort_index()
  2.  
  3. In [112]: dfm
  4. Out[112]:
  5. jolie
  6. jim joe
  7. 0 x 0.490671
  8. x 0.120248
  9. 1 y 0.110968
  10. z 0.537020
  11.  
  12. In [113]: dfm.index.is_lexsorted()
  13. Out[113]: True
  14.  
  15. In [114]: dfm.index.lexsort_depth
  16. Out[114]: 2

And now selection works as expected.

  1. In [115]: dfm.loc[(0, 'y'):(1, 'z')]
  2. Out[115]:
  3. jolie
  4. jim joe
  5. 1 y 0.110968
  6. z 0.537020

Take methods

Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also providesthe take() method that retrieves elements along a given axis at the givenindices. The given indices must be either a list or an ndarray of integerindex positions. take will also accept negative integers as relative positions to the end of the object.

  1. In [116]: index = pd.Index(np.random.randint(0, 1000, 10))
  2.  
  3. In [117]: index
  4. Out[117]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')
  5.  
  6. In [118]: positions = [0, 9, 3]
  7.  
  8. In [119]: index[positions]
  9. Out[119]: Int64Index([214, 329, 567], dtype='int64')
  10.  
  11. In [120]: index.take(positions)
  12. Out[120]: Int64Index([214, 329, 567], dtype='int64')
  13.  
  14. In [121]: ser = pd.Series(np.random.randn(10))
  15.  
  16. In [122]: ser.iloc[positions]
  17. Out[122]:
  18. 0 -0.179666
  19. 9 1.824375
  20. 3 0.392149
  21. dtype: float64
  22.  
  23. In [123]: ser.take(positions)
  24. Out[123]:
  25. 0 -0.179666
  26. 9 1.824375
  27. 3 0.392149
  28. dtype: float64

For DataFrames, the given indices should be a 1d list or ndarray that specifiesrow or column positions.

  1. In [124]: frm = pd.DataFrame(np.random.randn(5, 3))
  2.  
  3. In [125]: frm.take([1, 4, 3])
  4. Out[125]:
  5. 0 1 2
  6. 1 -1.237881 0.106854 -1.276829
  7. 4 0.629675 -1.425966 1.857704
  8. 3 0.979542 -1.633678 0.615855
  9.  
  10. In [126]: frm.take([0, 2], axis=1)
  11. Out[126]:
  12. 0 2
  13. 0 0.595974 0.601544
  14. 1 -1.237881 -1.276829
  15. 2 -0.767101 1.499591
  16. 3 0.979542 0.615855
  17. 4 0.629675 1.857704

It is important to note that the take method on pandas objects are notintended to work on boolean indices and may return unexpected results.

  1. In [127]: arr = np.random.randn(10)
  2.  
  3. In [128]: arr.take([False, False, True, True])
  4. Out[128]: array([-1.1935, -1.1935, 0.6775, 0.6775])
  5.  
  6. In [129]: arr[[0, 1]]
  7. Out[129]: array([-1.1935, 0.6775])
  8.  
  9. In [130]: ser = pd.Series(np.random.randn(10))
  10.  
  11. In [131]: ser.take([False, False, True, True])
  12. Out[131]:
  13. 0 0.233141
  14. 0 0.233141
  15. 1 -0.223540
  16. 1 -0.223540
  17. dtype: float64
  18.  
  19. In [132]: ser.iloc[[0, 1]]
  20. Out[132]:
  21. 0 0.233141
  22. 1 -0.223540
  23. dtype: float64

Finally, as a small note on performance, because the take method handlesa narrower range of inputs, it can offer performance that is a good dealfaster than fancy indexing.

  1. In [133]: arr = np.random.randn(10000, 5)
  2.  
  3. In [134]: indexer = np.arange(10000)
  4.  
  5. In [135]: random.shuffle(indexer)
  6.  
  7. In [136]: %timeit arr[indexer]
  8. .....: %timeit arr.take(indexer, axis=0)
  9. .....:
  10. 155 us +- 7.75 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)
  11. 41.5 us +- 530 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
  1. In [137]: ser = pd.Series(arr[:, 0])
  2.  
  3. In [138]: %timeit ser.iloc[indexer]
  4. .....: %timeit ser.take(indexer)
  5. .....:
  6. 121 us +- 4.48 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)
  7. 110 us +- 3 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)

Index types

We have discussed MultiIndex in the previous sections pretty extensively.Documentation about DatetimeIndex and PeriodIndex are shown here,and documentation about TimedeltaIndex is found here.

In the following sub-sections we will highlight some other index types.

CategoricalIndex

CategoricalIndex is a type of index that is useful for supportingindexing with duplicates. This is a container around a Categoricaland allows efficient indexing and storage of an index with a large number of duplicated elements.

  1. In [139]: from pandas.api.types import CategoricalDtype
  2.  
  3. In [140]: df = pd.DataFrame({'A': np.arange(6),
  4. .....: 'B': list('aabbca')})
  5. .....:
  6.  
  7. In [141]: df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
  8.  
  9. In [142]: df
  10. Out[142]:
  11. A B
  12. 0 0 a
  13. 1 1 a
  14. 2 2 b
  15. 3 3 b
  16. 4 4 c
  17. 5 5 a
  18.  
  19. In [143]: df.dtypes
  20. Out[143]:
  21. A int64
  22. B category
  23. dtype: object
  24.  
  25. In [144]: df.B.cat.categories
  26. Out[144]: Index(['c', 'a', 'b'], dtype='object')

Setting the index will create a CategoricalIndex.

  1. In [145]: df2 = df.set_index('B')
  2.  
  3. In [146]: df2.index
  4. Out[146]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Indexing with getitem/.iloc/.loc works similarly to an Index with duplicates.The indexers must be in the category or the operation will raise a KeyError.

  1. In [147]: df2.loc['a']
  2. Out[147]:
  3. A
  4. B
  5. a 0
  6. a 1
  7. a 5

The CategoricalIndex is preserved after indexing:

  1. In [148]: df2.loc['a'].index
  2. Out[148]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Sorting the index will sort by the order of the categories (recall that wecreated the index with CategoricalDtype(list('cab')), so the sortedorder is cab).

  1. In [149]: df2.sort_index()
  2. Out[149]:
  3. A
  4. B
  5. c 4
  6. a 0
  7. a 1
  8. a 5
  9. b 2
  10. b 3

Groupby operations on the index will preserve the index nature as well.

  1. In [150]: df2.groupby(level=0).sum()
  2. Out[150]:
  3. A
  4. B
  5. c 4
  6. a 6
  7. b 5
  8.  
  9. In [151]: df2.groupby(level=0).sum().index
  10. Out[151]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Reindexing operations will return a resulting index based on the type of the passedindexer. Passing a list will return a plain-old Index; indexing witha Categorical will return a CategoricalIndex, indexed according to the categoriesof the passed Categorical dtype. This allows one to arbitrarily index these even withvalues not in the categories, similarly to how you can reindex any pandas index.

  1. In [152]: df2.reindex(['a', 'e'])
  2. Out[152]:
  3. A
  4. B
  5. a 0.0
  6. a 1.0
  7. a 5.0
  8. e NaN
  9.  
  10. In [153]: df2.reindex(['a', 'e']).index
  11. Out[153]: Index(['a', 'a', 'a', 'e'], dtype='object', name='B')
  12.  
  13. In [154]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
  14. Out[154]:
  15. A
  16. B
  17. a 0.0
  18. a 1.0
  19. a 5.0
  20. e NaN
  21.  
  22. In [155]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
  23. Out[155]: CategoricalIndex(['a', 'a', 'a', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, name='B', dtype='category')

Warning

Reshaping and Comparison operations on a CategoricalIndex must have the same categoriesor a TypeError will be raised.

  1. In [9]: df3 = pd.DataFrame({'A': np.arange(6), 'B': pd.Series(list('aabbca')).astype('category')})
  2.  
  3. In [11]: df3 = df3.set_index('B')
  4.  
  5. In [11]: df3.index
  6. Out[11]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['a', 'b', 'c'], ordered=False, name='B', dtype='category')
  7.  
  8. In [12]: pd.concat([df2, df3])
  9. TypeError: categories must match existing categories when appending

Int64Index and RangeIndex

Warning

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.

Int64Index is a fundamental basic index in pandas.This is an immutable array implementing an ordered, sliceable set.Prior to 0.18.0, the Int64Index would provide the default index for all NDFrame objects.

RangeIndex is a sub-class of Int64Index added in version 0.18.0, now providing the default index for all NDFrame objects.RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are analogous to Python range types.

Float64Index

By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation.This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and slicing work exactly thesame.

  1. In [156]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])
  2.  
  3. In [157]: indexf
  4. Out[157]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
  5.  
  6. In [158]: sf = pd.Series(range(5), index=indexf)
  7.  
  8. In [159]: sf
  9. Out[159]:
  10. 1.5 0
  11. 2.0 1
  12. 3.0 2
  13. 4.5 3
  14. 5.0 4
  15. dtype: int64

Scalar selection for [],.loc will always be label based. An integer will match an equal float index (e.g. 3 is equivalent to 3.0).

  1. In [160]: sf[3]
  2. Out[160]: 2
  3.  
  4. In [161]: sf[3.0]
  5. Out[161]: 2
  6.  
  7. In [162]: sf.loc[3]
  8. Out[162]: 2
  9.  
  10. In [163]: sf.loc[3.0]
  11. Out[163]: 2

The only positional indexing is via iloc.

  1. In [164]: sf.iloc[3]
  2. Out[164]: 3

A scalar index that is not found will raise a KeyError.Slicing is primarily on the values of the index when using [],ix,loc, andalways positional when using iloc. The exception is when the slice isboolean, in which case it will always be positional.

  1. In [165]: sf[2:4]
  2. Out[165]:
  3. 2.0 1
  4. 3.0 2
  5. dtype: int64
  6.  
  7. In [166]: sf.loc[2:4]
  8. Out[166]:
  9. 2.0 1
  10. 3.0 2
  11. dtype: int64
  12.  
  13. In [167]: sf.iloc[2:4]
  14. Out[167]:
  15. 3.0 2
  16. 4.5 3
  17. dtype: int64

In float indexes, slicing using floats is allowed.

  1. In [168]: sf[2.1:4.6]
  2. Out[168]:
  3. 3.0 2
  4. 4.5 3
  5. dtype: int64
  6.  
  7. In [169]: sf.loc[2.1:4.6]
  8. Out[169]:
  9. 3.0 2
  10. 4.5 3
  11. dtype: int64

In non-float indexes, slicing using floats will raise a TypeError.

  1. In [1]: pd.Series(range(5))[3.5]
  2. TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
  3.  
  4. In [1]: pd.Series(range(5))[3.5:4.5]
  5. TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

Warning

Using a scalar float indexer for .iloc has been removed in 0.18.0, so the following will raise a TypeError:

  1. In [3]: pd.Series(range(5)).iloc[3.0]
  2. TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>

Here is a typical use-case for using this type of indexing. Imagine that you have a somewhatirregular timedelta-like indexing scheme, but the data is recorded as floats. This could, forexample, be millisecond offsets.

  1. In [170]: dfir = pd.concat([pd.DataFrame(np.random.randn(5, 2),
  2. .....: index=np.arange(5) * 250.0,
  3. .....: columns=list('AB')),
  4. .....: pd.DataFrame(np.random.randn(6, 2),
  5. .....: index=np.arange(4, 10) * 250.1,
  6. .....: columns=list('AB'))])
  7. .....:
  8.  
  9. In [171]: dfir
  10. Out[171]:
  11. A B
  12. 0.0 -0.435772 -1.188928
  13. 250.0 -0.808286 -0.284634
  14. 500.0 -1.815703 1.347213
  15. 750.0 -0.243487 0.514704
  16. 1000.0 1.162969 -0.287725
  17. 1000.4 -0.179734 0.993962
  18. 1250.5 -0.212673 0.909872
  19. 1500.6 -0.733333 -0.349893
  20. 1750.7 0.456434 -0.306735
  21. 2000.8 0.553396 0.166221
  22. 2250.9 -0.101684 -0.734907

Selection operations then will always work on a value basis, for all selection operators.

  1. In [172]: dfir[0:1000.4]
  2. Out[172]:
  3. A B
  4. 0.0 -0.435772 -1.188928
  5. 250.0 -0.808286 -0.284634
  6. 500.0 -1.815703 1.347213
  7. 750.0 -0.243487 0.514704
  8. 1000.0 1.162969 -0.287725
  9. 1000.4 -0.179734 0.993962
  10.  
  11. In [173]: dfir.loc[0:1001, 'A']
  12. Out[173]:
  13. 0.0 -0.435772
  14. 250.0 -0.808286
  15. 500.0 -1.815703
  16. 750.0 -0.243487
  17. 1000.0 1.162969
  18. 1000.4 -0.179734
  19. Name: A, dtype: float64
  20.  
  21. In [174]: dfir.loc[1000.4]
  22. Out[174]:
  23. A -0.179734
  24. B 0.993962
  25. Name: 1000.4, dtype: float64

You could retrieve the first 1 second (1000 ms) of data as such:

  1. In [175]: dfir[0:1000]
  2. Out[175]:
  3. A B
  4. 0.0 -0.435772 -1.188928
  5. 250.0 -0.808286 -0.284634
  6. 500.0 -1.815703 1.347213
  7. 750.0 -0.243487 0.514704
  8. 1000.0 1.162969 -0.287725

If you need integer based selection, you should use iloc:

  1. In [176]: dfir.iloc[0:5]
  2. Out[176]:
  3. A B
  4. 0.0 -0.435772 -1.188928
  5. 250.0 -0.808286 -0.284634
  6. 500.0 -1.815703 1.347213
  7. 750.0 -0.243487 0.514704
  8. 1000.0 1.162969 -0.287725

IntervalIndex

New in version 0.20.0.

IntervalIndex together with its own dtype, IntervalDtypeas well as the Interval scalar type, allow first-class support in pandasfor interval notation.

The IntervalIndex allows some unique indexing and is also used as areturn type for the categories in cut() and qcut().

Indexing with an IntervalIndex

An IntervalIndex can be used in Series and in DataFrame as the index.

  1. In [177]: df = pd.DataFrame({'A': [1, 2, 3, 4]},
  2. .....: index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
  3. .....:
  4.  
  5. In [178]: df
  6. Out[178]:
  7. A
  8. (0, 1] 1
  9. (1, 2] 2
  10. (2, 3] 3
  11. (3, 4] 4

Label based indexing via .loc along the edges of an interval works as you would expect,selecting that particular interval.

  1. In [179]: df.loc[2]
  2. Out[179]:
  3. A 2
  4. Name: (1, 2], dtype: int64
  5.  
  6. In [180]: df.loc[[2, 3]]
  7. Out[180]:
  8. A
  9. (1, 2] 2
  10. (2, 3] 3

If you select a label contained within an interval, this will also select the interval.

  1. In [181]: df.loc[2.5]
  2. Out[181]:
  3. A 3
  4. Name: (2, 3], dtype: int64
  5.  
  6. In [182]: df.loc[[2.5, 3.5]]
  7. Out[182]:
  8. A
  9. (2, 3] 3
  10. (3, 4] 4

Selecting using an Interval will only return exact matches (starting from pandas 0.25.0).

  1. In [183]: df.loc[pd.Interval(1, 2)]
  2. Out[183]:
  3. A 2
  4. Name: (1, 2], dtype: int64

Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.

  1. In [7]: df.loc[pd.Interval(0.5, 2.5)]

KeyError: Interval(0.5, 2.5, closed='right')

Selecting all Intervals that overlap a given Interval can be performed using theoverlaps() method to create a boolean indexer.

  1. In [184]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
  2.  
  3. In [185]: idxr
  4. Out[185]: array([ True, True, True, False])
  5.  
  6. In [186]: df[idxr]
  7. Out[186]:
  8. A
  9. (0, 1] 1
  10. (1, 2] 2
  11. (2, 3] 3

Binning data with cut and qcut

cut() and qcut() both return a Categorical object, and the bins theycreate are stored as an IntervalIndex in its .categories attribute.

  1. In [187]: c = pd.cut(range(4), bins=2)
  2.  
  3. In [188]: c
  4. Out[188]:
  5. [(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
  6. Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
  7.  
  8. In [189]: c.categories
  9. Out[189]:
  10. IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
  11. closed='right',
  12. dtype='interval[float64]')

cut() also accepts an IntervalIndex for its bins argument, which enablesa useful pandas idiom. First, We call cut() with some data and bins set to afixed number, to generate the bins. Then, we pass the values of .categories as thebins argument in subsequent calls to cut(), supplying new data which will bebinned into the same bins.

  1. In [190]: pd.cut([0, 3, 5, 1], bins=c.categories)
  2. Out[190]:
  3. [(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
  4. Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

Any value which falls outside all bins will be assigned a NaN value.

Generating ranges of intervals

If we need intervals on a regular frequency, we can use the interval_range() functionto create an IntervalIndex using various combinations of start, end, and periods.The default frequency for interval_range is a 1 for numeric intervals, and calendar day fordatetime-like intervals:

  1. In [191]: pd.interval_range(start=0, end=5)
  2. Out[191]:
  3. IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
  4. closed='right',
  5. dtype='interval[int64]')
  6.  
  7. In [192]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4)
  8. Out[192]:
  9. IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]],
  10. closed='right',
  11. dtype='interval[datetime64[ns]]')
  12.  
  13. In [193]: pd.interval_range(end=pd.Timedelta('3 days'), periods=3)
  14. Out[193]:
  15. IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]],
  16. closed='right',
  17. dtype='interval[timedelta64[ns]]')

The freq parameter can used to specify non-default frequencies, and can utilize a varietyof frequency aliases with datetime-like intervals:

  1. In [194]: pd.interval_range(start=0, periods=5, freq=1.5)
  2. Out[194]:
  3. IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]],
  4. closed='right',
  5. dtype='interval[float64]')
  6.  
  7. In [195]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4, freq='W')
  8. Out[195]:
  9. IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]],
  10. closed='right',
  11. dtype='interval[datetime64[ns]]')
  12.  
  13. In [196]: pd.interval_range(start=pd.Timedelta('0 days'), periods=3, freq='9H')
  14. Out[196]:
  15. IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]],
  16. closed='right',
  17. dtype='interval[timedelta64[ns]]')

Additionally, the closed parameter can be used to specify which side(s) the intervalsare closed on. Intervals are closed on the right side by default.

  1. In [197]: pd.interval_range(start=0, end=4, closed='both')
  2. Out[197]:
  3. IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]],
  4. closed='both',
  5. dtype='interval[int64]')
  6.  
  7. In [198]: pd.interval_range(start=0, end=4, closed='neither')
  8. Out[198]:
  9. IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)],
  10. closed='neither',
  11. dtype='interval[int64]')

New in version 0.23.0.

Specifying start, end, and periods will generate a range of evenly spacedintervals from start to end inclusively, with periods number of elementsin the resulting IntervalIndex:

  1. In [199]: pd.interval_range(start=0, end=6, periods=4)
  2. Out[199]:
  3. IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],
  4. closed='right',
  5. dtype='interval[float64]')
  6.  
  7. In [200]: pd.interval_range(pd.Timestamp('2018-01-01'),
  8. .....: pd.Timestamp('2018-02-28'), periods=3)
  9. .....:
  10. Out[200]:
  11. IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]],
  12. closed='right',
  13. dtype='interval[datetime64[ns]]')

Miscellaneous indexing FAQ

Integer indexing

Label-based indexing with integer axis labels is a thorny topic. It has beendiscussed heavily on mailing lists and among various members of the scientificPython community. In pandas, our general viewpoint is that labels matter morethan integer locations. Therefore, with an integer axis index _only_label-based indexing is possible with the standard tools like .loc. Thefollowing code will generate exceptions:

  1. In [201]: s = pd.Series(range(5))
  2.  
  3. In [202]: s[-1]
  4. ---------------------------------------------------------------------------
  5. KeyError Traceback (most recent call last)
  6. <ipython-input-202-76c3dce40054> in <module>
  7. ----> 1 s[-1]
  8.  
  9. /pandas/pandas/core/series.py in __getitem__(self, key)
  10. 1069 key = com.apply_if_callable(key, self)
  11. 1070 try:
  12. -> 1071 result = self.index.get_value(self, key)
  13. 1072
  14. 1073 if not is_scalar(result):
  15.  
  16. /pandas/pandas/core/indexes/base.py in get_value(self, series, key)
  17. 4728 k = self._convert_scalar_indexer(k, kind="getitem")
  18. 4729 try:
  19. -> 4730 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  20. 4731 except KeyError as e1:
  21. 4732 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
  22.  
  23. /pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
  24.  
  25. /pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
  26.  
  27. /pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
  28.  
  29. /pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
  30.  
  31. /pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
  32.  
  33. KeyError: -1
  34.  
  35. In [203]: df = pd.DataFrame(np.random.randn(5, 4))
  36.  
  37. In [204]: df
  38. Out[204]:
  39. 0 1 2 3
  40. 0 -0.130121 -0.476046 0.759104 0.213379
  41. 1 -0.082641 0.448008 0.656420 -1.051443
  42. 2 0.594956 -0.151360 -0.069303 1.221431
  43. 3 -0.182832 0.791235 0.042745 2.069775
  44. 4 1.446552 0.019814 -1.389212 -0.702312
  45.  
  46. In [205]: df.loc[-2:]
  47. Out[205]:
  48. 0 1 2 3
  49. 0 -0.130121 -0.476046 0.759104 0.213379
  50. 1 -0.082641 0.448008 0.656420 -1.051443
  51. 2 0.594956 -0.151360 -0.069303 1.221431
  52. 3 -0.182832 0.791235 0.042745 2.069775
  53. 4 1.446552 0.019814 -1.389212 -0.702312

This deliberate decision was made to prevent ambiguities and subtle bugs (manyusers reported finding bugs when the API change was made to stop “falling back”on position-based indexing).

Non-monotonic indexes require exact matches

If the index of a Series or DataFrame is monotonically increasing or decreasing, then the boundsof a label-based slice can be outside the range of the index, much like slice indexing anormal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing() andis_monotonic_decreasing() attributes.

  1. In [206]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=['data'], data=list(range(5)))
  2.  
  3. In [207]: df.index.is_monotonic_increasing
  4. Out[207]: True
  5.  
  6. # no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
  7. In [208]: df.loc[0:4, :]
  8. Out[208]:
  9. data
  10. 2 0
  11. 3 1
  12. 3 2
  13. 4 3
  14.  
  15. # slice is are outside the index, so empty DataFrame is returned
  16. In [209]: df.loc[13:15, :]
  17. Out[209]:
  18. Empty DataFrame
  19. Columns: [data]
  20. Index: []

On the other hand, if the index is not monotonic, then both slice bounds must beunique members of the index.

  1. In [210]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5],
  2. .....: columns=['data'], data=list(range(6)))
  3. .....:
  4.  
  5. In [211]: df.index.is_monotonic_increasing
  6. Out[211]: False
  7.  
  8. # OK because 2 and 4 are in the index
  9. In [212]: df.loc[2:4, :]
  10. Out[212]:
  11. data
  12. 2 0
  13. 3 1
  14. 1 2
  15. 4 3
  1. # 0 is not in the index
  2. In [9]: df.loc[0:4, :]
  3. KeyError: 0
  4.  
  5. # 3 is not a unique label
  6. In [11]: df.loc[2:3, :]
  7. KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasing and Index.is_monotonic_decreasing only check thatan index is weakly monotonic. To check for strict monotonicity, you can combine one of those withthe is_unique() attribute.

  1. In [213]: weakly_monotonic = pd.Index(['a', 'b', 'c', 'c'])
  2.  
  3. In [214]: weakly_monotonic
  4. Out[214]: Index(['a', 'b', 'c', 'c'], dtype='object')
  5.  
  6. In [215]: weakly_monotonic.is_monotonic_increasing
  7. Out[215]: True
  8.  
  9. In [216]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
  10. Out[216]: False

Endpoints are inclusive

Compared with standard Python sequence slicing in which the slice endpoint isnot inclusive, label-based slicing in pandas is inclusive. The primaryreason for this is that it is often not possible to easily determine the“successor” or next element after a particular label in an index. For example,consider the following Series:

  1. In [217]: s = pd.Series(np.random.randn(6), index=list('abcdef'))
  2.  
  3. In [218]: s
  4. Out[218]:
  5. a 0.301379
  6. b 1.240445
  7. c -0.846068
  8. d -0.043312
  9. e -1.658747
  10. f -0.819549
  11. dtype: float64

Suppose we wished to slice from c to e, using integers this would beaccomplished as such:

  1. In [219]: s[2:5]
  2. Out[219]:
  3. c -0.846068
  4. d -0.043312
  5. e -1.658747
  6. dtype: float64

However, if you only had c and e, determining the next element in theindex can be somewhat complicated. For example, the following does not work:

  1. s.loc['c':'e' + 1]

A very common use case is to limit a time series to start and end at twospecific dates. To enable this, we made the design choice to make label-basedslicing include both endpoints:

  1. In [220]: s.loc['c':'e']
  2. Out[220]:
  3. c -0.846068
  4. d -0.043312
  5. e -1.658747
  6. dtype: float64

This is most definitely a “practicality beats purity” sort of thing, but it issomething to watch out for if you expect label-based slicing to behave exactlyin the way that standard Python integer slicing works.

Indexing potentially changes underlying Series dtype

The different indexing operation can potentially change the dtype of a Series.

  1. In [221]: series1 = pd.Series([1, 2, 3])
  2.  
  3. In [222]: series1.dtype
  4. Out[222]: dtype('int64')
  5.  
  6. In [223]: res = series1.reindex([0, 4])
  7.  
  8. In [224]: res.dtype
  9. Out[224]: dtype('float64')
  10.  
  11. In [225]: res
  12. Out[225]:
  13. 0 1.0
  14. 4 NaN
  15. dtype: float64
  1. In [226]: series2 = pd.Series([True])
  2.  
  3. In [227]: series2.dtype
  4. Out[227]: dtype('bool')
  5.  
  6. In [228]: res = series2.reindex_like(series1)
  7.  
  8. In [229]: res.dtype
  9. Out[229]: dtype('O')
  10.  
  11. In [230]: res
  12. Out[230]:
  13. 0 True
  14. 1 NaN
  15. 2 NaN
  16. dtype: object

This is because the (re)indexing operations above silently inserts NaNs and the dtypechanges accordingly. This can cause some issues when using numpy ufuncssuch as numpy.logical_and.

See the this old issue for a moredetailed discussion.