All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”,
Return:
[“AAAAACCCCC”, “CCCCCAAAAA”].
因为涉及到查找重复元素的问题,所以用map做最简单。但是如果直接用string做key的话会导致内存超过限制,因此必须想点办法。考虑到只有A, C, G, T四种字符,因此可以将字符串映射成四进制整数,并用该值作为map的key值。代码如下:
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> result;
int val[26] = {0};
int max[26] = {0};
unordered_map<int, int> mp;
int key = 0;
int i;
val['C'-'A'] = 1;
val['G'-'A'] = 2;
val['T'-'A'] = 3;
max['C'-'A'] = 1 << 20;
max['G'-'A'] = 2 << 20;
max['T'-'A'] = 3 << 20;
if (s.length() < 10) return result;
for (i = 0; i < 10; ++i) {
key <<= 2;
key += val[s[i]-'A'];
}
mp[key]++;
for (i = 10; i < s.length(); ++i) {
key <<= 2;
key -= max[s[i-10]-'A'] - val[s[i]-'A'];
mp[key]++;
if (mp[key] == 2) result.push_back(s.substr(i-9, 10));
}
return result;
}
};
算法复杂度O(n)