重新索引和更改标签

reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other features relying on label-alignment functionality. To reindex means to conform the data to match a given set of labels along a particular axis. This accomplishes several things:

  • Reorders the existing data to match a new set of labels
  • Inserts missing value (NA) markers in label locations where no data for that label existed
  • If specified, fill data for missing labels using logic (highly relevant to working with time series data)

Here is a simple example:

  1. In [216]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
  2. In [217]: s
  3. Out[217]:
  4. a -0.454087
  5. b -0.360309
  6. c -0.951631
  7. d -0.535459
  8. e 0.835231
  9. dtype: float64
  10. In [218]: s.reindex(['e', 'b', 'f', 'd'])
  11. Out[218]:
  12. e 0.835231
  13. b -0.360309
  14. f NaN
  15. d -0.535459
  16. dtype: float64

Here, the f label was not contained in the Series and hence appears as NaN in the result.

With a DataFrame, you can simultaneously reindex the index and columns:

  1. In [219]: df
  2. Out[219]:
  3. one two three
  4. a -1.101558 1.124472 NaN
  5. b -0.177289 2.487104 -0.634293
  6. c 0.462215 -0.486066 1.931194
  7. d NaN -0.456288 -1.222918
  8. In [220]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
  9. Out[220]:
  10. three two one
  11. c 1.931194 -0.486066 0.462215
  12. f NaN NaN NaN
  13. b -0.634293 2.487104 -0.177289

You may also use reindex with an axis keyword:

  1. In [221]: df.reindex(['c', 'f', 'b'], axis='index')
  2. Out[221]:
  3. one two three
  4. c 0.462215 -0.486066 1.931194
  5. f NaN NaN NaN
  6. b -0.177289 2.487104 -0.634293

Note that the Index objects containing the actual axis labels can be shared between objects. So if we have a Series and a DataFrame, the following can be done:

  1. In [222]: rs = s.reindex(df.index)
  2. In [223]: rs
  3. Out[223]:
  4. a -0.454087
  5. b -0.360309
  6. c -0.951631
  7. d -0.535459
  8. dtype: float64
  9. In [224]: rs.index is df.index
  10. Out[224]: True

This means that the reindexed Series’s index is the same Python object as the DataFrame’s index.

New in version 0.21.0.

DataFrame.reindex() also supports an “axis-style” calling convention, where you specify a single labels argument and the axis it applies to.

  1. In [225]: df.reindex(['c', 'f', 'b'], axis='index')
  2. Out[225]:
  3. one two three
  4. c 0.462215 -0.486066 1.931194
  5. f NaN NaN NaN
  6. b -0.177289 2.487104 -0.634293
  7. In [226]: df.reindex(['three', 'two', 'one'], axis='columns')
  8. Out[226]:
  9. three two one
  10. a NaN 1.124472 -1.101558
  11. b -0.634293 2.487104 -0.177289
  12. c 1.931194 -0.486066 0.462215
  13. d -1.222918 -0.456288 NaN

See also MultiIndex / Advanced Indexing is an even more concise way of doing reindexing.

Note: When writing performance-sensitive code, there is a good reason to spend some time becoming a reindexing ninja: many operations are faster on pre-aligned data. Adding two unaligned DataFrames internally triggers a reindexing step. For exploratory analysis you will hardly notice the difference (because reindex has been heavily optimized), but when CPU cycles matter sprinkling a few explicit reindex calls here and there can have an impact.

Reindexing to align with another object

You may wish to take an object and reindex its axes to be labeled the same as another object. While the syntax for this is straightforward albeit verbose, it is a common enough operation that the reindex_like() method is available to make this simpler:

  1. In [227]: df2
  2. Out[227]:
  3. one two
  4. a -1.101558 1.124472
  5. b -0.177289 2.487104
  6. c 0.462215 -0.486066
  7. In [228]: df3
  8. Out[228]:
  9. one two
  10. a -0.829347 0.082635
  11. b 0.094922 1.445267
  12. c 0.734426 -1.527903
  13. In [229]: df.reindex_like(df2)
  14. Out[229]:
  15. one two
  16. a -1.101558 1.124472
  17. b -0.177289 2.487104
  18. c 0.462215 -0.486066

Aligning objects with each other with align

The align() method is the fastest way to simultaneously align two objects. It supports a join argument (related to joining and merging):

  • join='outer': take the union of the indexes (default)
  • join='left': use the calling object’s index
  • join='right': use the passed object’s index
  • join='inner': intersect the indexes

It returns a tuple with both of the reindexed Series:

  1. In [230]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
  2. In [231]: s1 = s[:4]
  3. In [232]: s2 = s[1:]
  4. In [233]: s1.align(s2)
  5. Out[233]:
  6. (a 0.505453
  7. b 1.788110
  8. c -0.405908
  9. d -0.801912
  10. e NaN
  11. dtype: float64, a NaN
  12. b 1.788110
  13. c -0.405908
  14. d -0.801912
  15. e 0.768460
  16. dtype: float64)
  17. In [234]: s1.align(s2, join='inner')
  18. Out[234]:
  19. (b 1.788110
  20. c -0.405908
  21. d -0.801912
  22. dtype: float64, b 1.788110
  23. c -0.405908
  24. d -0.801912
  25. dtype: float64)
  26. In [235]: s1.align(s2, join='left')
  27. Out[235]:
  28. (a 0.505453
  29. b 1.788110
  30. c -0.405908
  31. d -0.801912
  32. dtype: float64, a NaN
  33. b 1.788110
  34. c -0.405908
  35. d -0.801912
  36. dtype: float64)

For DataFrames, the join method will be applied to both the index and the columns by default:

  1. In [236]: df.align(df2, join='inner')
  2. Out[236]:
  3. ( one two
  4. a -1.101558 1.124472
  5. b -0.177289 2.487104
  6. c 0.462215 -0.486066, one two
  7. a -1.101558 1.124472
  8. b -0.177289 2.487104
  9. c 0.462215 -0.486066)

You can also pass an axis option to only align on the specified axis:

  1. In [237]: df.align(df2, join='inner', axis=0)
  2. Out[237]:
  3. ( one two three
  4. a -1.101558 1.124472 NaN
  5. b -0.177289 2.487104 -0.634293
  6. c 0.462215 -0.486066 1.931194, one two
  7. a -1.101558 1.124472
  8. b -0.177289 2.487104
  9. c 0.462215 -0.486066)

If you pass a Series to DataFrame.align(), you can choose to align both objects either on the DataFrame’s index or columns using the axis argument:

  1. In [238]: df.align(df2.iloc[0], axis=1)
  2. Out[238]:
  3. ( one three two
  4. a -1.101558 NaN 1.124472
  5. b -0.177289 -0.634293 2.487104
  6. c 0.462215 1.931194 -0.486066
  7. d NaN -1.222918 -0.456288, one -1.101558
  8. three NaN
  9. two 1.124472
  10. Name: a, dtype: float64)

Filling while reindexing

reindex() takes an optional parameter method which is a filling method chosen from the following table:

MethodAction
pad / ffillFill values forward
bfill / backfillFill values backward
nearestFill from the nearest index value

We illustrate these fill methods on a simple Series:

  1. In [239]: rng = pd.date_range('1/3/2000', periods=8)
  2. In [240]: ts = pd.Series(np.random.randn(8), index=rng)
  3. In [241]: ts2 = ts[[0, 3, 6]]
  4. In [242]: ts
  5. Out[242]:
  6. 2000-01-03 0.466284
  7. 2000-01-04 -0.457411
  8. 2000-01-05 -0.364060
  9. 2000-01-06 0.785367
  10. 2000-01-07 -1.463093
  11. 2000-01-08 1.187315
  12. 2000-01-09 -0.493153
  13. 2000-01-10 -1.323445
  14. Freq: D, dtype: float64
  15. In [243]: ts2
  16. Out[243]:
  17. 2000-01-03 0.466284
  18. 2000-01-06 0.785367
  19. 2000-01-09 -0.493153
  20. dtype: float64
  21. In [244]: ts2.reindex(ts.index)
  22. Out[244]:
  23. 2000-01-03 0.466284
  24. 2000-01-04 NaN
  25. 2000-01-05 NaN
  26. 2000-01-06 0.785367
  27. 2000-01-07 NaN
  28. 2000-01-08 NaN
  29. 2000-01-09 -0.493153
  30. 2000-01-10 NaN
  31. Freq: D, dtype: float64
  32. In [245]: ts2.reindex(ts.index, method='ffill')
  33. Out[245]:
  34. 2000-01-03 0.466284
  35. 2000-01-04 0.466284
  36. 2000-01-05 0.466284
  37. 2000-01-06 0.785367
  38. 2000-01-07 0.785367
  39. 2000-01-08 0.785367
  40. 2000-01-09 -0.493153
  41. 2000-01-10 -0.493153
  42. Freq: D, dtype: float64
  43. In [246]: ts2.reindex(ts.index, method='bfill')
  44. Out[246]:
  45. 2000-01-03 0.466284
  46. 2000-01-04 0.785367
  47. 2000-01-05 0.785367
  48. 2000-01-06 0.785367
  49. 2000-01-07 -0.493153
  50. 2000-01-08 -0.493153
  51. 2000-01-09 -0.493153
  52. 2000-01-10 NaN
  53. Freq: D, dtype: float64
  54. In [247]: ts2.reindex(ts.index, method='nearest')
  55. Out[247]:
  56. 2000-01-03 0.466284
  57. 2000-01-04 0.466284
  58. 2000-01-05 0.785367
  59. 2000-01-06 0.785367
  60. 2000-01-07 0.785367
  61. 2000-01-08 -0.493153
  62. 2000-01-09 -0.493153
  63. 2000-01-10 -0.493153
  64. Freq: D, dtype: float64

These methods require that the indexes are ordered increasing or decreasing.

Note that the same result could have been achieved using fillna (except for method='nearest') or interpolate:

  1. In [248]: ts2.reindex(ts.index).fillna(method='ffill')
  2. Out[248]:
  3. 2000-01-03 0.466284
  4. 2000-01-04 0.466284
  5. 2000-01-05 0.466284
  6. 2000-01-06 0.785367
  7. 2000-01-07 0.785367
  8. 2000-01-08 0.785367
  9. 2000-01-09 -0.493153
  10. 2000-01-10 -0.493153
  11. Freq: D, dtype: float64

reindex() will raise a ValueError if the index is not monotonically increasing or decreasing. fillna() and interpolate() will not perform any checks on the order of the index.

Limits on filling while reindexing

The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches:

  1. In [249]: ts2.reindex(ts.index, method='ffill', limit=1)
  2. Out[249]:
  3. 2000-01-03 0.466284
  4. 2000-01-04 0.466284
  5. 2000-01-05 NaN
  6. 2000-01-06 0.785367
  7. 2000-01-07 0.785367
  8. 2000-01-08 NaN
  9. 2000-01-09 -0.493153
  10. 2000-01-10 -0.493153
  11. Freq: D, dtype: float64

In contrast, tolerance specifies the maximum distance between the index and indexer values:

  1. In [250]: ts2.reindex(ts.index, method='ffill', tolerance='1 day')
  2. Out[250]:
  3. 2000-01-03 0.466284
  4. 2000-01-04 0.466284
  5. 2000-01-05 NaN
  6. 2000-01-06 0.785367
  7. 2000-01-07 0.785367
  8. 2000-01-08 NaN
  9. 2000-01-09 -0.493153
  10. 2000-01-10 -0.493153
  11. Freq: D, dtype: float64

Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into a Timedelta if possible. This allows you to specify tolerance with appropriate strings.

Dropping labels from an axis

A method closely related to reindex is the drop() function. It removes a set of labels from an axis:

  1. In [251]: df
  2. Out[251]:
  3. one two three
  4. a -1.101558 1.124472 NaN
  5. b -0.177289 2.487104 -0.634293
  6. c 0.462215 -0.486066 1.931194
  7. d NaN -0.456288 -1.222918
  8. In [252]: df.drop(['a', 'd'], axis=0)
  9. Out[252]:
  10. one two three
  11. b -0.177289 2.487104 -0.634293
  12. c 0.462215 -0.486066 1.931194
  13. In [253]: df.drop(['one'], axis=1)
  14. Out[253]:
  15. two three
  16. a 1.124472 NaN
  17. b 2.487104 -0.634293
  18. c -0.486066 1.931194
  19. d -0.456288 -1.222918

Note that the following also works, but is a bit less obvious / clean:

  1. In [254]: df.reindex(df.index.difference(['a', 'd']))
  2. Out[254]:
  3. one two three
  4. b -0.177289 2.487104 -0.634293
  5. c 0.462215 -0.486066 1.931194

Renaming / mapping labels

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

  1. In [255]: s
  2. Out[255]:
  3. a 0.505453
  4. b 1.788110
  5. c -0.405908
  6. d -0.801912
  7. e 0.768460
  8. dtype: float64
  9. In [256]: s.rename(str.upper)
  10. Out[256]:
  11. A 0.505453
  12. B 1.788110
  13. C -0.405908
  14. D -0.801912
  15. E 0.768460
  16. dtype: float64

If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique values). A dict or Series can also be used:

  1. In [257]: df.rename(columns={'one': 'foo', 'two': 'bar'},
  2. .....: index={'a': 'apple', 'b': 'banana', 'd': 'durian'})
  3. .....:
  4. Out[257]:
  5. foo bar three
  6. apple -1.101558 1.124472 NaN
  7. banana -0.177289 2.487104 -0.634293
  8. c 0.462215 -0.486066 1.931194
  9. durian NaN -0.456288 -1.222918

If the mapping doesn’t include a column/index label, it isn’t renamed. Note that extra labels in the mapping don’t throw an error.

New in version 0.21.0.

DataFrame.rename() also supports an “axis-style” calling convention, where you specify a single mapper and the axis to apply that mapping to.

  1. In [258]: df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')
  2. Out[258]:
  3. foo bar three
  4. a -1.101558 1.124472 NaN
  5. b -0.177289 2.487104 -0.634293
  6. c 0.462215 -0.486066 1.931194
  7. d NaN -0.456288 -1.222918
  8. In [259]: df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')
  9. Out[259]:
  10. one two three
  11. apple -1.101558 1.124472 NaN
  12. banana -0.177289 2.487104 -0.634293
  13. c 0.462215 -0.486066 1.931194
  14. durian NaN -0.456288 -1.222918

The rename() method also provides an inplace named parameter that is by default False and copies the underlying data. Pass inplace=True to rename the data in place.

New in version 0.18.0.

Finally, rename() also accepts a scalar or list-like for altering the Series.name attribute.

  1. In [260]: s.rename("scalar-name")
  2. Out[260]:
  3. a 0.505453
  4. b 1.788110
  5. c -0.405908
  6. d -0.801912
  7. e 0.768460
  8. Name: scalar-name, dtype: float64

The Panel class has a related rename_axis() class which can rename any of its three axes.