后缀树-SuffixTree(概念)

本文介绍了后缀树的基本概念及构建方法,并详细探讨了后缀树在字符串搜索、重复子串查找等应用场景中的高效算法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

基本概念

关于 suffix(后缀),suffix tree(后缀树),generalised suffix tree(一般后缀树)以及 suffix link(后缀链接)等等,都可以在如下页面找到明确的定义,不在此一一赘述。

http://www.answers.com/topics/suffix-tree (实际上就是【维基百科】)
 
看两个例子先

因 suffix link 难于在字符图内表示,故略之。
 
apple: |--apple$--[0]
       |--e$--[4]
       |--le$--[3]
       |--p--|--le$--[2]
             |--ple$--[1]
 
banana: |--a--|--$--[5]
        |     |--na--|--$--[3]
        |     |      |--na$--[1]
        |--banana$--[0]
        |--na--|--$--[4]
        |      |--na$--[2]
 
后缀树有什么用
 
同样是上面那个链接,在 Functionality 一栏,可以看到详细的描述。因为当中的条目,是本系列篇什的行动指南,故录之于此。

A suffix tree for a string S of length n can be built in Θ(n) time, if the alphabet is constant or integer. Otherwise, the construction time depends on the implementation. The costs below are given under the assumption that the alphabet is constant. If it is not, the cost depends on the implementation (see below).
 
若字符集恒定或为整数,则一个长为 n 的字符串 S,其后缀树可于 Θ(n) 的时间内得以构建。否则,构造时间依实现而定。如下给出的开销数据即基于字符集为恒定的假设。若字符集不为恒定,则开销依实现而定 。(偶发现【翻译】是一件及其痛苦和费时的事,决意以后不再翻译!)
 
Assume that a suffix tree has been built for the string S of length n, or that a generalised suffix tree has been built for the set of strings D = {S1,S2,...,SK} of total length n = | n1 | + | n2 | + ... + | nK | . You can:

  • Search for strings:
    • Check if a string P of length m is a substring in O(m) time.
    • Find all z occurrences of the patterns P1,...,Pq of total length m as substrings in O(m + z) time.
    • Search for a regular expression P in time expected sublinear on n.
    • Find for each suffix of a pattern P, the length of the longest match between a prefix of P[i...m] and a substring in D in Θ(m) time. This is termed the matching statistics for P.
  • Find properties of the strings:
    • Find the longest common substrings of the string Si and Sj in Θ(ni + nj) time.
    • Find all maximal pairs, maximal repeats or supermaximal repeats in Θ(n + z) time.
    • Find the Lempel-Ziv decomposition in Θ(n) time.
    • Find the longest repeated substrings in Θ(n) time.
    • Find the most frequently occurring substrings of a minimum length in Θ(n) time.
    • Find the shortest strings from Σ that do not occur in D, in O(n + z) time, if there are z such strings.
    • Find the shortest substrings occurring only once in Θ(n) time.
    • Find, for each i, the shortest substrings of Si not occurring elsewhere in D in Θ(n) time.

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in Θ(n) time. You can then also:

  • Find the longest common prefix between the suffixes Si[p..ni] and Sj[q..nj in Θ(1).
  • Search for a pattern P of length m with at most k mismatches in O(kn + z) time, where z is the number of hits.
  • Find all z maximal palindromes in Θ(n), or Θ(gn) time if gaps of length g are allowed, or Θ(kn) if k mismatches are allowed.
  • Find all z tandem repeats in O(nlogn + z), and k-mismatch tandem repeats in O(knlog(n / k) + z).
  • Find the longest substrings common to at least k strings in D for k = 2..K in Θ(n) time.
下一步
 
打算用 C++ 或 Python 来实现构建后缀树的 Ukkonen 方法。
 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值