Leetcode：Repeated DNA Sequences

最新推荐文章于 2025-12-03 23:35:37 发布

转载最新推荐文章于 2025-12-03 23:35:37 发布 · 703 阅读

文章标签：

#leetcode #DNA

算法及程序专栏收录该内容

45 篇文章

订阅专栏

本文介绍了一种使用二进制编码和哈希表优化技术来查找DNA序列中重复子串的高效算法。通过将DNA碱基编码为二进制数并利用哈希表进行快速查找，该算法显著降低了时间复杂度和内存使用，特别适用于处理长度较大的生物序列数据。文章详细解释了编码方案、哈希表设计以及遍历字符串的方法，并提供了代码实现，展示了如何在实际应用中应用这些技术来解决生物信息学中的常见问题。

转自：http://www.tuicool.com/articles/AnuQJjA

hash table plus bit manipulation method

（view the Show Tags and Runtime 10ms !）

算法分析

首先考虑将ACGT进行二进制编码

A -> 00

C -> 01

G -> 10

T -> 11

在编码的情况下，每10位字符串的组合即为一个数字，且10位的字符串有20位；一般来说int有4个字节，32位，即可以用于对应一个10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

20位的二进制数，至多有2^20种组合，因此hash table的大小为2^20，即1024 * 1024，将hash table设计为bool hashTable[1024 * 1024];（注意：若此时出现堆栈溢出，可申请动态内存，但一定要记得在最后释放，否则会MLE）

遍历字符串的设计

每次向右移动1位字符，相当于字符串对应的int值左移2位，再将其最低2位置为新的字符的编码值，最后将高2位置0。

时间复杂度

字符串遍历O(n)，hash tableO(1)；总时间复杂度O(n)

代码实现

 1 #include <string>
 2 #include <vector>
 3 #include <unordered_set>
 4 #include <cstring>
 5 
 6 bool hashMap[1024*1024];
 7 
 8 class Solution {
 9 public:
10     std::vector<std::string> findRepeatedDnaSequences(std::string s);
11 };
12 
13 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {
14     std::vector<std::string> rel;
15     if (s.length() <= 10) {
16         return rel;
17     }
18 
19     // map char to code
20     unsigned char convert[26];
21     convert[0] = 0; // 'A' - 'A'  00
22     convert[2] = 1; // 'C' - 'A'  01
23     convert[6] = 2; // 'G' - 'A'  10
24     convert[19] = 3; // 'T' - 'A' 11
25 
26     // initial process
27     // as ten length string
28     memset(hashMap, false, sizeof(hashMap));
29 
30     int hashValue = 0;
31 
32     for (int pos = 0; pos < 10; ++pos) {
33         hashValue <<= 2;
34         hashValue |= convert[s[pos] - 'A'];
35     }
36 
37     hashMap[hashValue] = true;
38 
39     std::unordered_set<int> strHashValue;
40 
41     // 
42     for (int pos = 10; pos < s.length(); ++pos) {
43         hashValue <<= 2;
44         hashValue |= convert[s[pos] - 'A'];
45         hashValue &= ~(0x300000);
46         
47         if (hashMap[hashValue]) {
48             if (strHashValue.find(hashValue) == strHashValue.end()) {
49                 rel.push_back(s.substr(pos - 9, 10));
50                 strHashValue.insert(hashValue);
51             }
52         } else {
53             hashMap[hashValue] = true;
54         }
55     }
56 
57     return rel; 
58 }

另外一个解释：http://blog.youkuaiyun.com/coderhuhy/article/details/43647731

思路历程:

-- 这是一个子串对自身匹配的问题;

-- 第一反应有3种方法: 其一, 使用hash(c++11的unordered_set)来存储所有长度为10的子串; 具体步骤是构造unordered_set<string> repeated, 遍历输入的原串, 对s[i]到s[i+9]的序列构成的子串, 如未出现在repeated中, 则存入repeated; 如出现在repeated中, 则说明该子串曾出现过, 符合题意要求, 将其存入答案vector<string> answer;

当然, 这个代码最后是memory limit exceeded; 先看代码, 后面分析;

[cpp]view plaincopy 
   
 class Solution {  
 public:  
     vector<string> findRepeatedDnaSequences(string s) {  
         vector<string> answer;  
         unordered_set<string> repeated;  
         for (int i = 0; i + 9 < s.size(); ++i) {  
             string t(s, i, 10);  
             unordered_set<string>::iterator it = repeated.find(t);  
             if (it != repeated.end())  
                 answer.push_back(t);//在repeated中查找成功, 说明曾出现过该子串, 则存入answer  
             else  
                 repeated.insert(t);//repeated中查找失败, 说明未曾出现过, 则存入repeated  
         }  
         return answer;  
     }  
 };//memory limit exceeded  

分析:

时间复杂度是O(n), 其中n为原串长度;

MLE的可能原因: unordered_set<string>对于超长的输入串, 会消耗大量的存储空间; 另外, 上面代码并未考虑answer中的重复答案, 因为每次只要出现在repeated中就放入answer, 这显然有问题;

改进方法:

-- 由于碱基无非ACGT四种类型, 可以使用00 01 10 11四个状态代替ACGT四种碱基, 比如AAGCT就是00 00 10 01 11, 对任意一个长度为10的子串都可以转化使用20个位的int值hint; 这样就可将unordered_set<string> repeated转变为unordered_set<int> repeated, 一定程度上减少了所需的存储空间;

-- 对于如何去重, 其一可以先收集所有答案, 再sort, unique去重, 当然这样很慢也很麻烦; 其二, 可以再构造一个unordered_set<int> check, 用于存储已经存入answer中的重复子串对应的hint值;

-- 值得一提的是, 每次从s[i]->s[i+9]变为s[i+1]->s[i+10], 使用了这样一个方法:

[cpp]view plaincopy 
   
 hint = ((hint & eraser) << 2) + ati[s[i]];  

其中eraser是一个宏定义, 值为0x3ffff, 二进制是00111111111111111111, 用于擦除hint中的最高2位s[i]碱基对应的值, 再左移2, 最后加上s[i+10]的碱基对应的值;

代码如下:

[cpp]view plaincopy 
   
 class Solution {  
 public:  
     #define eraser 0x3ffff  
     vector<string> findRepeatedDnaSequences(string s) {  
         vector<string> answer;  
         int hint = 0;//存储长度为10的子串翻译后的int值  
         if (s.size() < 10)  
             return answer;  
         unordered_set<unsigned int> repeated, check;//repeated存储已出现的子串, check用于防止重复答案  
         unordered_map<char, unsigned int> ati{{'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};//此处ati是存储各碱基对应的值00 01 10 11(c++11新语法)  
         for (int i = 0; i != 10; ++i) {  
             hint = (hint << 2) + ati[s[i]];//用s的前10个碱基构造初始hint值  
         }  
         repeated.insert(hint);  
         for (int i = 10; i != s.size(); ++i) {  
             hint = ((hint & eraser) << 2) + ati[s[i]]; //子串变化对应hint值变化  
             unordered_set<unsigned int>::iterator it = repeated.find(hint);  
             if (it != repeated.end()) {  
                 it = check.find(hint);  
                 if (it == check.end()) {  
                     answer.push_back(string(s, i - 9, 10));  
                     check.insert(hint);  
                 }  
             }  
             else  
                 repeated.insert(hint);  
         }  
         return answer;  
     }  
 };  

分析:

一开始由于忽略了移位与其他运算符的优先级关系, 一直出问题, 后来才发现

[cpp]view plaincopy 
   
 hint = ((hint & eraser) << 2) + ati[s[i]];  

里面的外面那层括号没加上, 浪费了不少时间, 确实不应该