Leetcode|Repeated DNA Sequences-优快云博客

本文链接：https://blog.youkuaiyun.com/mike_learns_to_rock/article/details/46646373

本文介绍了一种算法，用于找出DNA分子中长度为10且出现多次的序列。通过两种方法实现，一是直接使用字符串，二是将序列转换为4进制数字以减少内存使用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

解法1：暴力法（ Memory Limit Exceeded）从头到尾依次查询,借助map统计出现次数。大于1的就算

vector<string> findRepeatedDnaSequences(string s) {
        map<string,int> key;
        vector<string> res;
        if(s.size()<=10) return res;
        for(int i=0;i<=s.size()-10;i++)
        {
            string tmp=s.substr(i,10);
            key[tmp]++;
            if(key[tmp]==2) res.push_back(tmp);
        }
        return res;
    }

解法2：存储太多字符串会导致memory过大，因为字符只有四种，这个好办了，把字符串表示成4进制的数字就OK了。(72ms)

int ACGT2INT(char c)
{
    switch(c)
    {
        case 'A': return 0;
        case 'C': return 1;
        case 'G': return 2;
        case 'T': return 3;
    }
    return -1;
}

int DNA2INT(string& m)
{
    const int MAX=10;
    int res=0;
    for(int i=0;i<MAX;i++)
    {
        res=res*4+ACGT2INT(m[i]);
    }
    return res;
}

vector<string> findRepeatedDnaSequences(string s) {
    const int N=1048576;
    int key[N];
    memset(key,0,sizeof(key));
    vector<string> res;
    if(s.size()<=10) return res;
    for(int i=0;i<=s.size()-10;i++)
    {
        string tmp=s.substr(i,10);
        key[DNA2INT(tmp)]++;
        if(key[DNA2INT(tmp)]==2) res.push_back(tmp);
    }
    return res;
    }

注意这里我用了一个数组来记录出现次数，

const int N=1048576;//因为四进制的10位数，最大值不会超过1024^2
    int key[N];
    memset(key,0,sizeof(key));

但是如果把这些换成unordered_map<int,int> key; 运行时间为150ms左右。（leetcode 30个例子测试时间)

如果换成map<int,int> key;测试时间为280ms。

所以可以看出数组和map还有unordered_map的效率问题。

能不用后两者的就用数组记录hash情况。