Repeated DNA Sequences

描述

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:

["AAAAACCCCC", "CCCCCAAAAA"].

分析

首先能想到一个简单直接的方法,用一个长度为10的窗口,从左到右扫描,放入 HashMap,并把计数器增一。最后,把 HashMap 中所有计数器大于1的字符串输出来。时间复杂度 O(n), 由于HashMap中存储了所有长度为10的子串,所以空间复杂度O(10n)

由于字符串中只存在 A, C, G, T 四种字符,我们可以把每个字符映射为2个bit:

  1. A -> 00
  2. C -> 01
  3. G -> 10
  4. T -> 11

每个长度为10的字符串,可以映射为 20 bits, 小于32位,因此可以把这个字符串映射到一个整数。这个方法时间复杂度依旧是O(n),但空间复杂度下降到了O(n)

解法1 简单粗暴

  1. // Repeated DNA Sequences
  2. // Time Complexity: O(n), Space Complexity: O(10n)
  3. public class Solution {
  4. public List<String> findRepeatedDnaSequences(String s) {
  5. final List<String> result = new ArrayList<>();
  6. if (s.length() < 10) return result;
  7. final Map<String, Integer> counter = new HashMap<>();
  8. for (int i = 0; i < s.length() - 9; ++i) {
  9. final String key = s.substring(i, i + 10);
  10. int value = counter.getOrDefault(key, 0);
  11. counter.put(key, value + 1);
  12. }
  13. for (Map.Entry<String, Integer> entry : counter.entrySet()) {
  14. if (entry.getValue() > 1) {
  15. result.add(entry.getKey());
  16. }
  17. }
  18. return result;
  19. }
  20. }

解法2 完美哈希

  1. // Repeated DNA Sequences
  2. // Time Complexity: O(n), Space Complexity: O(n)
  3. public class Solution {
  4. public List<String> findRepeatedDnaSequences(String s) {
  5. final List<String> result = new ArrayList<>();
  6. if (s.length() < LEN) return result;
  7. final Map<Character, Integer> charMap = new HashMap<>();
  8. charMap.put('A', 0);
  9. charMap.put('C', 1);
  10. charMap.put('G', 2);
  11. charMap.put('T', 3);
  12. final Map<Integer, Character> intMap = new HashMap<>();
  13. intMap.put(0, 'A');
  14. intMap.put(1, 'C');
  15. intMap.put(2, 'G');
  16. intMap.put(3, 'T');
  17. final Map<Integer, Integer> counter = new HashMap<>();
  18. for (int i = 0; i < s.length() - LEN + 1; ++i) {
  19. final String key = s.substring(i, i + 10);
  20. final int hashValue = strToInt(key, charMap);
  21. counter.put(hashValue, counter.getOrDefault(hashValue, 0) + 1);
  22. }
  23. for (Map.Entry<Integer, Integer> entry : counter.entrySet()) {
  24. if (entry.getValue() > 1) {
  25. result.add(intToStr(entry.getKey(), intMap));
  26. }
  27. }
  28. return result;
  29. }
  30. // perfect hash, no collisions
  31. private static int strToInt(String s, Map<Character, Integer> charMap) {
  32. assert s.length() == LEN;
  33. int x = 0;
  34. for (int i = 0; i < LEN; ++i) {
  35. final char ch = s.charAt(i);
  36. x = (x << 2) + charMap.get(ch);
  37. }
  38. return x;
  39. }
  40. private String intToStr(int x, Map<Integer, Character> intMap) {
  41. final StringBuilder sb = new StringBuilder();
  42. while (x > 0) {
  43. final char ch = intMap.get(x & 3);
  44. sb.append(ch);
  45. x >>= 2;
  46. }
  47. while (sb.length() < LEN) sb.append(intMap.get(0));
  48. return sb.reverse().toString();
  49. }
  50. private static final int LEN = 10;
  51. }

原文: https://soulmachine.gitbooks.io/algorithm-essentials/content/java/bitwise-operations/repeated-dna-sequences.html