AC自动机与后缀数组
PA6.1
G -> output: []
A -> output: []
A -> output: [‘AA’]
T -> output: [‘GAAT’]
C -> output: [‘C’]
T -> output: []
C -> output: [‘GCTC’, ‘C’]
C -> output: [‘C’]
T -> output: []
C -> output: [‘C’]
A -> output: []
G -> output: [‘TCAG’]
T -> output: [‘TT’]
A -> output: []
A -> output: [‘AA’]
T -> output: []
T -> output: [‘TT’]
A -> output: [‘ATTA’]
1、How many nodes does thing resulting automaton have?
Root node (1 node)
Pattern “GAAT” adds 4 nodes: G → A → A → T
Pattern “C” adds 1 node: C
Pattern “GCTC” adds 3 new nodes: G → C → T → C (reuses the initial “G” from “GAAT”)
Pattern “TCAG” adds 4 new nodes: T → C → A → G
Pattern “AA” adds 2 new nodes: A → A
Pattern “ATTA” adds 4 new nodes: A → T → T → A
Pattern “TT” adds 2 new nodes: T → T
Total nodes: 1 (root) + 4 + 1 + 3 + 4 + 2 + 4 + 2 = 19 nodes
2、How many failure links does thing resulting automaton have?
Root node: Has no failure link (it’s the starting state).
Direct children of the root (G, C, T, A): Each links back to the root (4 failure links).
Remaining nodes (14 nodes): Each has a failure link pointing to another node in the trie.
Total failure links: 0 (root) + 4 (direct children) + 14 (remaining nodes) = 18 failure links.
3、How many dictionary links does thing resulting automaton have?
The resulting Aho-Corasick automaton has 7 dictionary links (also known as output links). Here’s why:
Dictionary links connect nodes that represent the end of a complete pattern (i.e., nodes with non-empty output lists).
Each of the 7 input patterns (“GAAT”, “C”, “GCTC”, “TCAG”, “AA”, “ATTA”, “TT”) ends at a unique node in the trie.
Therefore, there are 7 nodes with output values, each contributing one dictionary link.
Total dictionary links: 7.
PA6.2
节点数:121(包含所有唯一前缀节点)
失败链接数:120(每个非根节点有一个失败链接)
字典链接数:11(对应输入的 11 个模式串)
PA6.3
后缀数组
0 TCCAAT
1 CCAAT
2 CAAT
3 AAT
4 AT
5 T
按字典序排序
3,4,2,1,5,0
PA6.3
明确后缀数组概念:后缀数组是将字符串的所有后缀按字典序排序后,记录这些后缀在原字符串中起始位置的数组 。
列出所有后缀:对于字符串 “CGGAAGGTCTA” :
起始位置为
0的后缀是 “CGGAAGGTCTA” ;
起始位置为1的后缀是 “GGAAGGTCTA” ;
起始位置为2的后缀是 “GAAGGTCTA” ;
起始位置为3的后缀是 “AAGGTCTA” ;
起始位置为4的后缀是 “AGGTCTA” ;
起始位置为5的后缀是 “GGTCTA” ;
起始位置为6的后缀是 “GTCTA” ;
起始位置为7的后缀是 “TCTA” ;
起始位置为8的后缀是 “CTA” ;
起始位置为9的后缀是 “TA” ;
起始位置为10的后缀是 “A” 。
按字典序排序后缀:
经比较,“A”(起始位置10)字典序最小;
其次是 “AAGGTCTA”(起始位置3) ;
接着是 “AGGTCTA”(起始位置4 ) ;
然后是 “CTA”(起始位置8 ) ;
再是 “GAAGGTCTA”(起始位置2) ;
之后是 “GGTCTA”(起始位置5 ) ;
接着是 “GGAAGGTCTA”(起始位置1 ) ;
再之后是 “GTCTA”(起始位置6 ) ;
然后是 “TA”(起始位置9) ;
接着是 “TCTA”(起始位置7) ;
最大的是 “CGGAAGGTCTA”(起始位置0 ) 。
确定后缀数组值:按照字典序排序后,对应的起始位置组成后缀数组,即
[10,3,4,8,2,5,1,6,9,7,0]
PA6.5 后缀数组编码实现
核心代码:
#include <string>
#include <vector>
#include <algorithm>
#include "SuffixArray.h"
using namespace std;
// 定义一个结构体,重载()运算符来实现比较逻辑
struct SuffixComparator {
const string& s; // 存储传入的字符串引用
// 构造函数,初始化字符串引用
SuffixComparator(const string& str) : s(str) {}
// 重载()运算符,实现具体的比较规则
bool operator()(unsigned int i, unsigned int j) const {
return s.substr(i) < s.substr(j); // 比较两个索引对应的后缀字符串字典序
}
};
vector<unsigned int> suffix_array(const string & s) {
vector<unsigned int> indices;
for (unsigned int i = 0; i < s.length(); ++i) {
indices.push_back(i);
}
// 使用结构体仿函数作为比较规则进行排序
sort(indices.begin(), indices.end(), SuffixComparator(s));
return indices;
}
需要掌握sort函数的使用,代码比较简单。
725

被折叠的 条评论
为什么被折叠?



