All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].由于直接将字符串存入字典会导致Memory Limit Exceeded,采用位操作将字符串转化为整数可以减少内存开销
此题由于构成输入字符串的字符只有四种,分别是A, C, G, T,下面我们来看下它们的ASCII码用二进制来表示:
65 A: 0100 0001
67 C: 0100 0011
71 G: 0100 0111
84 T: 0101 0100
由于我们的目的是利用位来区分字符,当然是越少位越好,通过观察发现,每个字符的后三位都不相同,故而我们可以用末尾三位来区分这四个字符。而题目要求是10个字符长度的串,每个字符用三位来区分,10个字符需要30位,在32位机上也OK。
算法就是:题目要求10个character,首先提取出前9个,然后依次往后扫描, 前面9个char,总共需要用27位bit,那么每次用0x7ffffff,来取出后27位,然后往前shift3位,继续取后面的char的后三位,然后用hashset来统计词是否出现过,如果出现过,那么就加入list中。
取后三位的操作是: s.charAt(i) & 7
最后用hashset来去掉重复,最后再将其转换成list;
public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
List<String> list = new ArrayList<String>();
if(s == null || s.length() <= 10) return list;
HashSet<Integer> hashset = new HashSet<Integer>();
HashSet<String> resset = new HashSet<String>();
int mask = 0x7FFFFFF;
int cur = 0;
int i=0;
while(i<9){
cur = ((cur<<3) | (s.charAt(i) & 7));
i++;
}
while(i<s.length()){
cur = ((cur & mask) << 3 | (s.charAt(i) & 7));
i++;
if(hashset.contains(cur)){
resset.add(s.substring(i-10, i));
} else {
hashset.add(cur);
}
}
return new ArrayList<String>(resset);
}
}