All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
hashmap很容易想到,但有存在一个问题:如何把十位的字符串表示成一个key;
注意到ACGT ASC码中ACGT后三位分别不同,可以用后三位来代表每一个字母,那么十位字符串总共30位,不超过一个int;
当然也可以用00, 01, 10, 11来表示这四个字符,不过更繁琐。
其次在编码过程中注意:
1. +的优先级远高于& |
2. 前++与后++的区别
3. STL中unordered_map的用法
4. 控制vector中的唯一性用==
具体源码如下:
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
std::unordered_map<int, int> m;
vector<string> result;
int i=0, tmp = 0;
while(i<10) {
tmp = (tmp<<3) + (s[i++]&7); //注意此处i++,是先算,后加;注意此处要加括号!!
}
m[tmp]++;
while(i<s.size()) {
tmp = (tmp<<3&0x3FFFFFFF) + (s[i++]&7); //注意此处括号,+的优先级大于&,所以要括号;也可换成|去掉括号
if(++m[tmp]/*++*/ == 2) //深刻注意此处前++与后++的区别,2用于排除>2时不重复的情况
result.push_back(s.substr(i-10, 10));
}
return result;
}
};