Language:
Description
It's well known that DNA Sequence is a sequence only contains A, C, T and G, and it's very useful to analyze a segment of DNA Sequence,For example, if a animal's DNA sequence contains segment ATC then it may mean that the animal may have a genetic disease. Until now scientists have found several those segments, the problem is how many kinds of DNA sequences of a species don't contain those segments.
Suppose that DNA sequences of a species is a sequence that consist of A, C, T and G,and the length of sequences is a given integer n. Input
First line contains two integer m (0 <= m <= 10), n (1 <= n <=2000000000). Here, m is the number of genetic disease segment, and n is the length of sequences.
Next m lines each line contain a DNA genetic disease segment, and length of these segments is not larger than 10. Output
An integer, the number of DNA sequences, mod 100000.
Sample Input 4 3 AT AC AG AA Sample Output 36 Source |
题意:给定m个仅由A、C、G、T构成的非法串,问仅由A、C、G、T构成的长度为n的字符串共有多少个。
思路:先将m个非法串建ac机,用ac自动机求出初始矩阵,初始矩阵A.matrix[i][j]表示从自动机的节点i走一步不经过熟悉串的结尾有几种方法可以走到节点j,然后用对A.matrix矩阵进行二分快速幂求A^L,其中A.matrix[i][j]表示从i到j经过L步不经过熟悉串的结尾的方法数。对于初始矩阵,我们注意到对于自动机节点i来说,如果它的cnt!=0,那么它并没有存在的意义。所以我们对ac机上cnt=0的点进行重新编码,建立初始矩阵,最后答案为sum(A.matrix[0][k],0<=k<sz'),sz'为ac机上cnt=0的个数。需要注意matrix要用long long ,详见程序:
#include<cstdio>
#include<cstring>
#include<algorithm>
using namespace std;
typedef long long ll;
const int MAXN=100+50;
const int mod=100000;
const int sigma_size=4;
int n,m,head,tail,sz;
ll l;
struct node
{
int cnt,id;
node *next[sigma_size],*fail;
}trie[MAXN],*root,*que[MAXN];
struct AC
{
node *createnode()
{
for(int k=0;k<sigma_size;k++)
trie[sz].next[k]=NULL;
trie[sz].fail=NULL;
trie[sz].cnt=0,trie[sz].id=sz;
return &trie[sz++];
}
void init()
{
sz=0;
head=tail=0;
root=createnode();
}
int idx(char c)
{
if(c=='A') return 0;
if(c=='T') return 1;
if(c=='G') return 2;
return 3;
}
void insert(char *str)
{
node *p=root;
int len=strlen(str);
for(int i=0;i<len;i++)
{
int k=idx(str[i]);
if(p->next[k]==NULL)
p->next[k]=createnode();
p=p->next[k];
}
p->cnt++;
}
void get_fail()
{
que[tail++]=root;
while(head<tail)
{
node *p=que[head++];
for(int k=0;k<sigma_size;k++)
if(p->next[k])
{
if(p==root)
p->next[k]->fail=root;
else
p->next[k]->fail=p->fail->next[k];
p->next[k]->cnt|=p->next[k]->fail->cnt;
que[tail++]=p->next[k];
}
else
{
if(p==root)
p->next[k]=root;
else
p->next[k]=p->fail->next[k];
}
}
}
}ac;
struct Matrix
{
ll matrix[MAXN][MAXN];
}E;
Matrix matrix_mul(Matrix a,Matrix b)
{
Matrix c;
for(int i=0;i<m;i++)
for(int j=0;j<m;j++)
{
c.matrix[i][j]=0;
for(int k=0;k<m;k++)
if(a.matrix[i][k] && b.matrix[k][j])
{
c.matrix[i][j]+=a.matrix[i][k]*b.matrix[k][j];
if(c.matrix[i][j]>=mod)
c.matrix[i][j]%=mod;
}
}
return c;
}
Matrix matrix_pow(Matrix a,ll k)
{
Matrix c=E;
while(k)
{
if(k&1)
c=matrix_mul(c,a);
a=matrix_mul(a,a);
k>>=1;
}
return c;
}
int main()
{
//freopen("text.txt","r",stdin);
for(int i=0;i<MAXN;i++)
E.matrix[i][i]=1;
while(~scanf("%d%I64d",&n,&l))
{
ac.init();
char str[15];
for(int i=0;i<n;i++)
{
scanf("%s",str);
ac.insert(str);
}
ac.get_fail();
Matrix A;
int u=0,v,num;
for(int i=0;i<sz;i++)
if(!trie[i].cnt)
{
v=0;
for(int j=0;j<sz;j++)
if(!trie[j].cnt)
{
num=0;
for(int k=0;k<sigma_size;k++)
if(trie[i].next[k]->id==trie[j].id)
num++;
A.matrix[u][v]=num;
v++;
}
u++;
}
m=u;
A=matrix_pow(A,l);
ll ans=0;
for(int i=0;i<m;i++)
{
ans+=A.matrix[0][i];
if(ans>=mod)
ans%=mod;
}
printf("%I64d\n",ans);
}
return 0;
}
Language:
DNA Sequence
Description
It's well known that DNA Sequence is a sequence only contains A, C, T and G, and it's very useful to analyze a segment of DNA Sequence,For example, if a animal's DNA sequence contains segment ATC then it may mean that the animal may have a genetic disease. Until now scientists have found several those segments, the problem is how many kinds of DNA sequences of a species don't contain those segments.
Suppose that DNA sequences of a species is a sequence that consist of A, C, T and G,and the length of sequences is a given integer n. Input
First line contains two integer m (0 <= m <= 10), n (1 <= n <=2000000000). Here, m is the number of genetic disease segment, and n is the length of sequences.
Next m lines each line contain a DNA genetic disease segment, and length of these segments is not larger than 10. Output
An integer, the number of DNA sequences, mod 100000.
Sample Input 4 3 AT AC AG AA Sample Output 36 Source |