All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",Return:["AAAAACCCCC", "CCCCCAAAAA"].
[分析]
HASHMAP方法会EXCEED SPACE LIMIT.
因为只有4个字母,所以可以创建自己的hashkey, 每两个BITS, 对应一个 incoming character. 超过20bit 即10个字符时, 只保留20bits.
[注意]
1. (hash<<2) + map.get(c) 符号优先级, << 一定要括起来.
public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
List<String> res = new ArrayList<String>();
if(s==null || s.length() < 11) return res;
int hash = 0;
Map<Character, Integer> map = new HashMap<Character, Integer>();
map.put('A', 0);
map.put('C', 1);
map.put('G', 2);
map.put('T', 3);
Set<Integer> set = new HashSet<Integer>();
Set<Integer> unique = new HashSet<Integer>();
for(int i=0; i<s.length(); i++) {
char c = s.charAt(i);
if(i<9) {
hash = (hash<<2) + map.get(c);
} else {
hash = (hash<<2) + map.get(c);
hash &= (1<<20) - 1;
if( set.contains(hash) && !unique.contains(hash)) {
res.add(s.substring(i-9, i+1));
unique.add(hash);
} else {
set.add(hash);
}
}
}
return res;
}
}