All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].解法1:暴力法( Memory Limit Exceeded)从头到尾依次查询,借助map统计出现次数。大于1的就算
vector<string> findRepeatedDnaSequences(string s) {
map<string,int> key;
vector<string> res;
if(s.size()<=10) return res;
for(int i=0;i<=s.size()-10;i++)
{
string tmp=s.substr(i,10);
key[tmp]++;
if(key[tmp]==2) res.push_back(tmp);
}
return res;
}
解法2:存储太多字符串会导致memory过大,因为字符只有四种,这个好办了,把字符串表示成4进制的数字就OK了。(72ms)
int ACGT2INT(char c)
{
switch(c)
{
case 'A': return 0;
case 'C': return 1;
case 'G': return 2;
case 'T': return 3;
}
return -1;
}
int DNA2INT(string& m)
{
const int MAX=10;
int res=0;
for(int i=0;i<MAX;i++)
{
res=res*4+ACGT2INT(m[i]);
}
return res;
}
vector<string> findRepeatedDnaSequences(string s) {
const int N=1048576;
int key[N];
memset(key,0,sizeof(key));
vector<string> res;
if(s.size()<=10) return res;
for(int i=0;i<=s.size()-10;i++)
{
string tmp=s.substr(i,10);
key[DNA2INT(tmp)]++;
if(key[DNA2INT(tmp)]==2) res.push_back(tmp);
}
return res;
}
注意这里我用了一个数组来记录出现次数,
const int N=1048576;//因为四进制的10位数,最大值不会超过1024^2
int key[N];
memset(key,0,sizeof(key));
但是如果把这些换成unordered_map<int,int> key; 运行时间为150ms左右。(leetcode 30个例子测试时间)
如果换成map<int,int> key;测试时间为280ms。
所以可以看出数组和map还有unordered_map的效率问题。
能不用后两者的就用数组记录hash情况。