Finding and removing duplicate records

Problem

You want to find and/or remove duplicate entries from a vector or data frame.

Solution

With vectors:

  1. # Generate a vector
  2. set.seed(158)
  3. x <- round(rnorm(20, 10, 5))
  4. x
  5. #> [1] 14 11 8 4 12 5 10 10 3 3 11 6 0 16 8 10 8 5 6 6
  6. # For each element: is this one a duplicate (first instance of a particular value
  7. # not counted)
  8. duplicated(x)
  9. #> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
  10. #> [15] TRUE TRUE TRUE TRUE TRUE TRUE
  11. # The values of the duplicated entries
  12. # Note that '6' appears in the original vector three times, and so it has two
  13. # entries here.
  14. x[duplicated(x)]
  15. #> [1] 10 3 11 8 10 8 5 6 6
  16. # Duplicated entries, without repeats
  17. unique(x[duplicated(x)])
  18. #> [1] 10 3 11 8 5 6
  19. # The original vector with all duplicates removed. These do the same:
  20. unique(x)
  21. #> [1] 14 11 8 4 12 5 10 3 6 0 16
  22. x[!duplicated(x)]
  23. #> [1] 14 11 8 4 12 5 10 3 6 0 16

With data frames:

  1. # A sample data frame:
  2. df <- read.table(header=TRUE, text='
  3. label value
  4. A 4
  5. B 3
  6. C 6
  7. B 3
  8. B 1
  9. A 2
  10. A 4
  11. A 4
  12. ')
  13. # Is each row a repeat?
  14. duplicated(df)
  15. #> [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
  16. # Show the repeat entries
  17. df[duplicated(df),]
  18. #> label value
  19. #> 4 B 3
  20. #> 7 A 4
  21. #> 8 A 4
  22. # Show unique repeat entries (row names may differ, but values are the same)
  23. unique(df[duplicated(df),])
  24. #> label value
  25. #> 4 B 3
  26. #> 7 A 4
  27. # Original data with repeats removed. These do the same:
  28. unique(df)
  29. #> label value
  30. #> 1 A 4
  31. #> 2 B 3
  32. #> 3 C 6
  33. #> 5 B 1
  34. #> 6 A 2
  35. df[!duplicated(df),]
  36. #> label value
  37. #> 1 A 4
  38. #> 2 B 3
  39. #> 3 C 6
  40. #> 5 B 1
  41. #> 6 A 2