排序

Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination of both.

By Index

The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its index levels.

  1. In [307]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
  2. .....: 'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
  3. .....: 'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
  4. .....:
  5. In [308]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
  6. .....: columns=['three', 'two', 'one'])
  7. .....:
  8. In [309]: unsorted_df
  9. Out[309]:
  10. three two one
  11. a NaN 0.708543 0.036274
  12. d -0.540166 0.586626 NaN
  13. c 0.410238 1.121731 1.044630
  14. b -0.282532 -2.038777 -0.490032
  15. # DataFrame
  16. In [310]: unsorted_df.sort_index()
  17. Out[310]:
  18. three two one
  19. a NaN 0.708543 0.036274
  20. b -0.282532 -2.038777 -0.490032
  21. c 0.410238 1.121731 1.044630
  22. d -0.540166 0.586626 NaN
  23. In [311]: unsorted_df.sort_index(ascending=False)
  24. Out[311]:
  25. three two one
  26. d -0.540166 0.586626 NaN
  27. c 0.410238 1.121731 1.044630
  28. b -0.282532 -2.038777 -0.490032
  29. a NaN 0.708543 0.036274
  30. In [312]: unsorted_df.sort_index(axis=1)
  31. Out[312]:
  32. one three two
  33. a 0.036274 NaN 0.708543
  34. d NaN -0.540166 0.586626
  35. c 1.044630 0.410238 1.121731
  36. b -0.490032 -0.282532 -2.038777
  37. # Series
  38. In [313]: unsorted_df['three'].sort_index()
  39. Out[313]:
  40. a NaN
  41. b -0.282532
  42. c 0.410238
  43. d -0.540166
  44. Name: three, dtype: float64

By Values

The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values() method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.sort_values() may used to specify one or more columns to use to determine the sorted order.

  1. In [314]: df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
  2. In [315]: df1.sort_values(by='two')
  3. Out[315]:
  4. one two three
  5. 0 2 1 5
  6. 2 1 2 3
  7. 1 1 3 4
  8. 3 1 4 2

The by parameter can take a list of column names, e.g.:

  1. In [316]: df1[['one', 'two', 'three']].sort_values(by=['one','two'])
  2. Out[316]:
  3. one two three
  4. 2 1 2 3
  5. 1 1 3 4
  6. 3 1 4 2
  7. 0 2 1 5

These methods have special treatment of NA values via the na_position argument:

  1. In [317]: s[2] = np.nan
  2. In [318]: s.sort_values()
  3. Out[318]:
  4. 0 A
  5. 3 Aaba
  6. 1 B
  7. 4 Baca
  8. 6 CABA
  9. 8 cat
  10. 7 dog
  11. 2 NaN
  12. 5 NaN
  13. dtype: object
  14. In [319]: s.sort_values(na_position='first')
  15. Out[319]:
  16. 2 NaN
  17. 5 NaN
  18. 0 A
  19. 3 Aaba
  20. 1 B
  21. 4 Baca
  22. 6 CABA
  23. 8 cat
  24. 7 dog
  25. dtype: object

By Indexes and Values

New in version 0.23.0.

Strings passed as the by parameter to DataFrame.sort_values() may refer to either columns or index level names.

  1. # Build MultiIndex
  2. In [320]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
  3. .....: ('b', 2), ('b', 1), ('b', 1)])
  4. .....:
  5. In [321]: idx.names = ['first', 'second']
  6. # Build DataFrame
  7. In [322]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
  8. .....: index=idx)
  9. .....:
  10. In [323]: df_multi
  11. Out[323]:
  12. A
  13. first second
  14. a 1 6
  15. 2 5
  16. 2 4
  17. b 2 3
  18. 1 2
  19. 1 1

Sort by ‘second’ (index) and ‘A’ (column)

  1. In [324]: df_multi.sort_values(by=['second', 'A'])
  2. Out[324]:
  3. A
  4. first second
  5. b 1 1
  6. 1 2
  7. a 1 6
  8. b 2 3
  9. a 2 4
  10. 2 5

Note: If a string matches both a column name and an index level name then a warning is issued and the column takes precedence. This will result in an ambiguity error in a future version.

searchsorted

Series has the searchsorted() method, which works similarly to numpy.ndarray.searchsorted().

  1. In [325]: ser = pd.Series([1, 2, 3])
  2. In [326]: ser.searchsorted([0, 3])
  3. Out[326]: array([0, 2])
  4. In [327]: ser.searchsorted([0, 4])
  5. Out[327]: array([0, 3])
  6. In [328]: ser.searchsorted([1, 3], side='right')
  7. Out[328]: array([1, 3])
  8. In [329]: ser.searchsorted([1, 3], side='left')
  9. Out[329]: array([0, 2])
  10. In [330]: ser = pd.Series([3, 1, 2])
  11. In [331]: ser.searchsorted([0, 3], sorter=np.argsort(ser))
  12. Out[331]: array([0, 2])

smallest / largest values

Series has the nsmallest() and nlargest() methods which return the smallest or largest n values. For a large Series this can be much faster than sorting the entire Series and calling head(n) on the result.

  1. In [332]: s = pd.Series(np.random.permutation(10))
  2. In [333]: s
  3. Out[333]:
  4. 0 8
  5. 1 2
  6. 2 9
  7. 3 5
  8. 4 6
  9. 5 0
  10. 6 1
  11. 7 7
  12. 8 4
  13. 9 3
  14. dtype: int64
  15. In [334]: s.sort_values()
  16. Out[334]:
  17. 5 0
  18. 6 1
  19. 1 2
  20. 9 3
  21. 8 4
  22. 3 5
  23. 4 6
  24. 7 7
  25. 0 8
  26. 2 9
  27. dtype: int64
  28. In [335]: s.nsmallest(3)
  29. Out[335]:
  30. 5 0
  31. 6 1
  32. 1 2
  33. dtype: int64
  34. In [336]: s.nlargest(3)
  35. Out[336]:
  36. 2 9
  37. 0 8
  38. 7 7
  39. dtype: int64

DataFrame also has the nlargest and nsmallest methods.

  1. In [337]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
  2. .....: 'b': list('abdceff'),
  3. .....: 'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
  4. .....:
  5. In [338]: df.nlargest(3, 'a')
  6. Out[338]:
  7. a b c
  8. 5 11 f 3.0
  9. 3 10 c 3.2
  10. 4 8 e NaN
  11. In [339]: df.nlargest(5, ['a', 'c'])
  12. Out[339]:
  13. a b c
  14. 6 -1 f 4.0
  15. 5 11 f 3.0
  16. 3 10 c 3.2
  17. 4 8 e NaN
  18. 2 1 d 4.0
  19. In [340]: df.nsmallest(3, 'a')
  20. Out[340]:
  21. a b c
  22. 0 -2 a 1.0
  23. 1 -1 b 2.0
  24. 6 -1 f 4.0
  25. In [341]: df.nsmallest(5, ['a', 'c'])
  26. Out[341]:
  27. a b c
  28. 0 -2 a 1.0
  29. 2 1 d 4.0
  30. 4 8 e NaN
  31. 1 -1 b 2.0
  32. 6 -1 f 4.0

Sorting by a multi-index column

You must be explicit about sorting when the column is a multi-index, and fully specify all levels to by.

  1. In [342]: df1.columns = pd.MultiIndex.from_tuples([('a','one'),('a','two'),('b','three')])
  2. In [343]: df1.sort_values(by=('a','two'))
  3. Out[343]:
  4. a b
  5. one two three
  6. 0 2 1 5
  7. 2 1 2 3
  8. 1 1 3 4
  9. 3 1 4 2