AC自动机与后缀数组

原创已于 2025-05-12 10:39:30 修改 · 934 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#数据结构

于 2025-05-12 10:20:32 首次发布

AC自动机与后缀数组

PA6.1

G -> output: []
A -> output: []
A -> output: [‘AA’]
T -> output: [‘GAAT’]
C -> output: [‘C’]
T -> output: []
C -> output: [‘GCTC’, ‘C’]
C -> output: [‘C’]
T -> output: []
C -> output: [‘C’]
A -> output: []
G -> output: [‘TCAG’]
T -> output: [‘TT’]
A -> output: []
A -> output: [‘AA’]
T -> output: []
T -> output: [‘TT’]
A -> output: [‘ATTA’]

1、How many nodes does thing resulting automaton have?

Root node (1 node)
Pattern “GAAT” adds 4 nodes: G → A → A → T
Pattern “C” adds 1 node: C
Pattern “GCTC” adds 3 new nodes: G → C → T → C (reuses the initial “G” from “GAAT”)
Pattern “TCAG” adds 4 new nodes: T → C → A → G
Pattern “AA” adds 2 new nodes: A → A
Pattern “ATTA” adds 4 new nodes: A → T → T → A
Pattern “TT” adds 2 new nodes: T → T
Total nodes: 1 (root) + 4 + 1 + 3 + 4 + 2 + 4 + 2 = 19 nodes

2、How many failure links does thing resulting automaton have?

Root node: Has no failure link (it’s the starting state).
Direct children of the root (G, C, T, A): Each links back to the root (4 failure links).
Remaining nodes (14 nodes): Each has a failure link pointing to another node in the trie.
Total failure links: 0 (root) + 4 (direct children) + 14 (remaining nodes) = 18 failure links.

3、How many dictionary links does thing resulting automaton have?

The resulting Aho-Corasick automaton has 7 dictionary links (also known as output links). Here’s why:
Dictionary links connect nodes that represent the end of a complete pattern (i.e., nodes with non-empty output lists).
Each of the 7 input patterns (“GAAT”, “C”, “GCTC”, “TCAG”, “AA”, “ATTA”, “TT”) ends at a unique node in the trie.
Therefore, there are 7 nodes with output values, each contributing one dictionary link.
Total dictionary links: 7.

PA6.2

节点数：121（包含所有唯一前缀节点）
失败链接数：120（每个非根节点有一个失败链接）
字典链接数：11（对应输入的 11 个模式串）

PA6.3

后缀数组
0 TCCAAT
1 CCAAT
2 CAAT
3 AAT
4 AT
5 T

按字典序排序
3,4,2,1,5,0

PA6.3

明确后缀数组概念：后缀数组是将字符串的所有后缀按字典序排序后，记录这些后缀在原字符串中起始位置的数组。
列出所有后缀：对于字符串 “CGGAAGGTCTA” ：
起始位置为
0的后缀是 “CGGAAGGTCTA” ；
起始位置为1的后缀是 “GGAAGGTCTA” ；
起始位置为2的后缀是 “GAAGGTCTA” ；
起始位置为3的后缀是 “AAGGTCTA” ；
起始位置为4的后缀是 “AGGTCTA” ；
起始位置为5的后缀是 “GGTCTA” ；
起始位置为6的后缀是 “GTCTA” ；
起始位置为7的后缀是 “TCTA” ；
起始位置为8的后缀是 “CTA” ；
起始位置为9的后缀是 “TA” ；
起始位置为10的后缀是 “A” 。
按字典序排序后缀：
经比较，“A”（起始位置10）字典序最小；
其次是 “AAGGTCTA”（起始位置3）；
接着是 “AGGTCTA”（起始位置4 ）；
然后是 “CTA”（起始位置8 ）；
再是 “GAAGGTCTA”（起始位置2）；
之后是 “GGTCTA”（起始位置5 ）；
接着是 “GGAAGGTCTA”（起始位置1 ）；
再之后是 “GTCTA”（起始位置6 ）；
然后是 “TA”（起始位置9）；
接着是 “TCTA”（起始位置7）；
最大的是 “CGGAAGGTCTA”（起始位置0 ）。
确定后缀数组值：按照字典序排序后，对应的起始位置组成后缀数组，即
[10,3,4,8,2,5,1,6,9,7,0]

PA6.5 后缀数组编码实现

核心代码：

#include <string>
#include <vector>
#include <algorithm>
#include "SuffixArray.h"
using namespace std;

// 定义一个结构体，重载()运算符来实现比较逻辑
struct SuffixComparator {
    const string& s; // 存储传入的字符串引用
    // 构造函数，初始化字符串引用
    SuffixComparator(const string& str) : s(str) {} 
    // 重载()运算符，实现具体的比较规则
    bool operator()(unsigned int i, unsigned int j) const { 
        return s.substr(i) < s.substr(j); // 比较两个索引对应的后缀字符串字典序
    }
};

vector<unsigned int> suffix_array(const string & s) {
    vector<unsigned int> indices;
    for (unsigned int i = 0; i < s.length(); ++i) {
        indices.push_back(i);
    }
    // 使用结构体仿函数作为比较规则进行排序
    sort(indices.begin(), indices.end(), SuffixComparator(s)); 
    return indices;
}