trie -- suffix tree -- suffix automa 有这么一些应用场景:
即时响应用户输入的AJAX搜索框时, 显示候选列表。
搜索引擎的关键字个数统计。
后缀树(Suffix Tree): 从根到叶子表示一个后缀。
仅仅从这一个简单的描述,我们可以概念上解决下面的几个问题:
P:查找字符串o是否在字符串S中
A:若o在S中,则o必然是S的某个后缀的前缀。 用S构造后缀树,按在trie中搜索字串的方法搜索o即可。
P: 指定字符串T在字符串S中的重复次数。
A: 如果T在S中重复了两次,则S应有两个后缀以T为前缀,搜索T节点下的叶节点数目即为重复次数。
P: 字符串S中的最长重复子串。
A: 同上,找到最深的非叶节点T。
P: 两个字符串S1,S2的最长公共子串。
A: 广义后缀树(Generalized Suffix Tree)存储_多个_字符串各自的所有后缀。把两个字符串S1#,S2$加入到广义后缀树中,然后同上。
(A longest substring common to s1 and s2 will be the path-label of an internal node with the
greatest string depth in the suffix tree which has leaves labelled with suffixes from both the
strings.)
Suffix Automa: 识别文本所有子串的辅助索引结构。
下面的代码是直接翻译[1]中算法A:
/*Directed Acyclic Word Graph
*/
#include <stdlib.h>
#include <string.h>
typedef struct State{
struct State *first[26], *second[26];
struct State *suffix;
}State;
State *sink, *source;
State *new_state(void)
{
State *s = malloc(sizeof *s);
if(s){
memset(s, 0, sizeof *s);
}
return s;
}
/*state:
parent -- [x] with xa = tail(wa)
child -- [tail(wa)]
new child -- [tail(wa)]_{wa}
*/
State *split(State *parent, int a)
{
int i;
/*current state, child, new child*/
State *cs = parent, *c = parent->second[a], *nc = new_state(); //S1
parent->first[a] = parent->second[a] = nc; //S2
for(i = 0; i < 26; ++i){
nc->second[i] = c->second[i]; //S3
}
nc->suffix = c->suffix; //S4
c->suffix = nc; //S5
for(cs = parent; cs != source; ){//S6,7
cs = cs->suffix; //S7.a
for(i = 0; i < 26; ++i){
if(cs->second[i] == c)cs->second[i] = nc; //S7.b
else goto _out; //S7.c
}
}
_out:
return nc; //S8
}
/*state:
new sink -- [wa]
*/
void update(int a)
{
/*suffix state, current state, new sink*/
State *ss = NULL, *cs = sink, *ns = new_state(); //U1,2
sink->first[a] = ns;
while(cs != source && ss == NULL){//U3
cs = cs->suffix; //U3.a
if(!cs->first[a] && !cs->second[a]){
cs->second[a] = ns; //U3.b.1
}else if(cs->first[a]){
ss = cs->first[a]; //U3.b.2
}else if(cs->second[a]){
ss = split(cs, a); //U3.b.3
}
}
if(ss == NULL){ss = source;} //U4
ns->suffix = ss; sink = ns; //U5
}
int build_dawg(char *w)
{
sink = source = new_state();
for(; *w; ++w){update(*w-'a');}
}
我还在努力理解中,没有测试。
[1] the smallest automation recognizing the subwords of a text
https://cbse.soe.ucsc.edu/sites/default/files/smallest_automaton1985.pdf