重复数据

If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: duplicated and drop_duplicates. Each takes as an argument the columns to use to identify duplicated rows.

  • duplicated returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.
  • drop_duplicates removes duplicate rows.

By default, the first observed row of a duplicate set is considered unique, but each method has a keep parameter to specify targets to be kept.

  • keep='first' (default): mark / drop duplicates except for the first occurrence.
  • keep='last': mark / drop duplicates except for the last occurrence.
  • keep=False: mark / drop all duplicates.
  1. In [268]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
  2. .....: 'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
  3. .....: 'c': np.random.randn(7)})
  4. .....:
  5. In [269]: df2
  6. Out[269]:
  7. a b c
  8. 0 one x -1.067137
  9. 1 one y 0.309500
  10. 2 two x -0.211056
  11. 3 two y -1.842023
  12. 4 two x -0.390820
  13. 5 three x -1.964475
  14. 6 four x 1.298329
  15. In [270]: df2.duplicated('a')
  16. Out[270]:
  17. 0 False
  18. 1 True
  19. 2 False
  20. 3 True
  21. 4 True
  22. 5 False
  23. 6 False
  24. dtype: bool
  25. In [271]: df2.duplicated('a', keep='last')
  26. Out[271]:
  27. 0 True
  28. 1 False
  29. 2 True
  30. 3 True
  31. 4 False
  32. 5 False
  33. 6 False
  34. dtype: bool
  35. In [272]: df2.duplicated('a', keep=False)
  36. Out[272]:
  37. 0 True
  38. 1 True
  39. 2 True
  40. 3 True
  41. 4 True
  42. 5 False
  43. 6 False
  44. dtype: bool
  45. In [273]: df2.drop_duplicates('a')
  46. Out[273]:
  47. a b c
  48. 0 one x -1.067137
  49. 2 two x -0.211056
  50. 5 three x -1.964475
  51. 6 four x 1.298329
  52. In [274]: df2.drop_duplicates('a', keep='last')
  53. Out[274]:
  54. a b c
  55. 1 one y 0.309500
  56. 4 two x -0.390820
  57. 5 three x -1.964475
  58. 6 four x 1.298329
  59. In [275]: df2.drop_duplicates('a', keep=False)
  60. Out[275]:
  61. a b c
  62. 5 three x -1.964475
  63. 6 four x 1.298329

Also, you can pass a list of columns to identify duplications.

  1. In [276]: df2.duplicated(['a', 'b'])
  2. Out[276]:
  3. 0 False
  4. 1 False
  5. 2 False
  6. 3 False
  7. 4 True
  8. 5 False
  9. 6 False
  10. dtype: bool
  11. In [277]: df2.drop_duplicates(['a', 'b'])
  12. Out[277]:
  13. a b c
  14. 0 one x -1.067137
  15. 1 one y 0.309500
  16. 2 two x -0.211056
  17. 3 two y -1.842023
  18. 5 three x -1.964475
  19. 6 four x 1.298329

To drop duplicates by index value, use Index.duplicated then perform slicing. The same set of options are available for the keep parameter.

  1. In [278]: df3 = pd.DataFrame({'a': np.arange(6),
  2. .....: 'b': np.random.randn(6)},
  3. .....: index=['a', 'a', 'b', 'c', 'b', 'a'])
  4. .....:
  5. In [279]: df3
  6. Out[279]:
  7. a b
  8. a 0 1.440455
  9. a 1 2.456086
  10. b 2 1.038402
  11. c 3 -0.894409
  12. b 4 0.683536
  13. a 5 3.082764
  14. In [280]: df3.index.duplicated()
  15. Out[280]: array([False, True, False, False, True, True], dtype=bool)
  16. In [281]: df3[~df3.index.duplicated()]
  17. Out[281]:
  18. a b
  19. a 0 1.440455
  20. b 2 1.038402
  21. c 3 -0.894409
  22. In [282]: df3[~df3.index.duplicated(keep='last')]
  23. Out[282]:
  24. a b
  25. c 3 -0.894409
  26. b 4 0.683536
  27. a 5 3.082764
  28. In [283]: df3[~df3.index.duplicated(keep=False)]
  29. Out[283]:
  30. a b
  31. c 3 -0.894409