后缀树--来自Wiki

http://en.wikipedia.org/wiki/Suffix_tree 

 

Suffix tree

From Wikipedia, the free encyclopedia

Jump to: navigation, search
Suffix tree for the string BANANA padded with $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the boxes give the start position of the corresponding suffix. Suffix links drawn dashed.

In computer science, a suffix tree (also called suffix trie, PAT tree or, in an earlier form, position tree) is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.

The suffix tree for a string S is a tree whose edges are labeled with strings, and such that each suffix of S corresponds to exactly one path from the tree's root to a leaf. It is thus a radix tree (more specifically, a Patricia trie) for the suffixes of S.

Constructing such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

Contents

[hide]

[edit] History

The concept was first introduced as a position tree by Weiner in 1973[1] in a paper which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976 [2] , and also by Ukkonen in 1995[3][4]. Ukkonen provided the first linear-time online-construction of suffix trees, now known as Ukkonen's algorithm.

[edit] Definition

The suffix tree for the string S of length n is defined as a tree such that ([5] page 90):

  • the paths from the root to the leaves have a one-to-one relationship with the suffixes of S,
  • edges spell non-empty strings,
  • and all internal nodes (except perhaps the root) have at least two children.

Since such a tree does not exist for all strings, S is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be n leaf nodes, one for each of the n suffixes of S. Since all internal non-root nodes are branching, there can be at most n − 1 such nodes, and n + (n − 1) + 1 = 2n nodes in total.

Suffix links are a key feature for linear-time construction of the tree. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string χα, where χ is a single character and α is a string (possibly empty), it has a suffix link to the internal node representing α. See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.

[edit] Functionality

A suffix tree for a string S of length n can be built in Θ(n) time, if the alphabet is constant or integer [6]. Otherwise, the construction time depends on the implementation. The costs below are given under the assumption that the alphabet is constant. If it is not, the cost depends on the implementation (see below).

Assume that a suffix tree has been built for the string S of length n, or that a generalised suffix tree has been built for the set of strings D = {S1,S2,...,SK} of total length n = | n1 | + | n2 | + ... + | nK | . You can:

  • Search for strings:
    • Check if a string P of length m is a substring in O(m) time ([5] page 92).
    • Find the first occurrence of the patterns P1,...,Pq of total length m as substrings in O(m) time, when the suffix tree is built using Ukkonen's algorithm.
    • Find all z occurrences of the patterns P1,...,Pq of total length m as substrings in O(m + z) time ([5] page 123).
    • Search for a regular expression P in time expected sublinear in n ([7]).
    • Find for each suffix of a pattern P, the length of the longest match between a prefix of P[i...m] and a substring in D in Θ(m) time ([5] page 132). This is termed the matching statistics for P.
  • Find properties of the strings:
    • Find the longest common substrings of the string Si and Sj in Θ(ni + nj) time ([5] page 125).
    • Find all maximal pairs, maximal repeats or supermaximal repeats in Θ(n + z) time ([5] page 144).
    • Find the Lempel-Ziv decomposition in Θ(n) time ([5] page 166).
    • Find the longest repeated substrings in Θ(n) time.
    • Find the most frequently occurring substrings of a minimum length in Θ(n) time.
    • Find the shortest strings from Σ that do not occur in D, in O(n + z) time, if there are z such strings.
    • Find the shortest substrings occurring only once in Θ(n) time.
    • Find, for each i, the shortest substrings of Si not occurring elsewhere in D in Θ(n) time.

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in Θ(n) time ([5] chapter 8). You can then also:

  • Find the longest common prefix between the suffixes Si[p..ni] and Sj[q..nj] in Θ(1) ([5] page 196).
  • Search for a pattern P of length m with at most k mismatches in O(kn + z) time, where z is the number of hits ([5] page 200).
  • Find all z maximal palindromes in Θ(n)([5] page 198), or Θ(gn) time if gaps of length g are allowed, or Θ(kn) if k mismatches are allowed ([5] page 201).
  • Find all z tandem repeats in O(nlogn + z), and k-mismatch tandem repeats in O(knlog(n / k) + z) ([5] page 204).
  • Find the longest substrings common to at least k strings in D for k = 2..K in Θ(n) time ([5] page 205).

[edit] Uses

Suffix trees are often used in bioinformatics applications, where they are used for searching for patterns in DNA or protein sequences, which can be viewed as long strings of characters. The ability to search efficiently with mismatches might be the suffix tree's greatest strength. It is also used in data compression, where on the one hand it is used to find repeated data and on the other hand it can be used for the sorting stage of the Burrows-Wheeler transform. Variants of the LZW compression schemes use it (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines (first introduced in [8]).

[edit] Implementation

If each node and edge can be represented in Θ(1) space, the entire tree can be represented in Θ(n) space. The total length of the edges in the tree is O(n2), but each edge can be stored as the position and length of a substring of S, giving a total space usage of Θ(n) computer words. The worst-case space usage of a suffix tree is seen with a fibonacci string, giving the full 2n nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Hash maps, sorted/unsorted arrays (with array doubling), and balanced search trees may also be used, giving different running time properties. We are interested in:

  • The cost of finding the child on a given character.
  • The cost of inserting a child.
  • The cost of enlisting all children of a node (divided by the number of children in the table below).

Let σ be the size of the alphabet. Then you have the following costs:

 LookupInsertionTraversal
Sibling lists / unsorted arraysO(σ)Θ(1)Θ(1)
Hash mapsΘ(1)Θ(1)O(σ)
Balanced search treeO(logσ)O(logσ)O(1)
Sorted arraysO(logσ)O(σ)O(1)
Hash maps + sibling listsO(1)O(1)O(1)

Note that the insertion cost is amortised, and that the costs for hashing are given perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

[edit] See also

[edit] References

  1. ^ P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory: 1-11. 
  2. ^ Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): 262--272. 
  3. ^ E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): 249--260. 
  4. ^ R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): 331--353. 
  5. ^ a b c d e f g h i j k l m n Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8. 
  6. ^ Martin Farach (1997). "Optimal suffix tree construction with large alphabets". Foundations of Computer Science, 38th Annual Symposium on: 137--143. 
  7. ^ Ricardo A. Baeza-Yates and Gaston H. Gonnet (1996). "Fast text searching for regular expressions or automaton searching on tries". Journal of the ACM 43: 915--936. ACM Press. doi:10.1145/235809.235810. ISSN 0004-5411. 
  8. ^ Oren Zamir and Oren Etzioni (1998). "Web document clustering: a feasibility demonstration". SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval: 46--54, ACM. 

[edit] External links

【语音分离】基于平均谐波结构建模的无监督单声道音乐声源分离(Matlab代码实现)内容概要:本文介绍了基于平均谐波结构建模的无监督单声道音乐声源分离方法,并提供了相应的Matlab代码实现。该方法通过对音乐信号中的谐波结构进行建模,利用音源间的频率特征差异,实现对混合音频中不同乐器或人声成分的有效分离。整个过程无需标注数据,属于无监督学习范畴,适用于单通道录音场景下的语音与音乐分离任务。文中强调了算法的可复现性,并附带完整的仿真资源链接,便于读者学习与验证。; 适合人群:具备一定信号处理基础和Matlab编程能力的高校学生、科研人员及从事音频处理、语音识别等相关领域的工程师;尤其适合希望深入理解声源分离原理并进行算法仿真实践的研究者。; 使用场景及目标:①用于音乐音频中人声与伴奏的分离,或不同乐器之间的分离;②支持无监督条件下的语音处理研究,推动盲源分离技术的发展;③作为学术论文复现、课程项目开发或科研原型验证的技术参考。; 阅读建议:建议读者结合提供的Matlab代码与网盘资料同步运行调试,重点关注谐波建模与频谱分解的实现细节,同时可扩展学习盲源分离中的其他方法如独立成分分析(ICA)或非负矩阵分解(NMF),以加深对音频信号分离机制的理解。
内容概要:本文系统介绍了新能源汽车领域智能底盘技术的发展背景、演进历程、核心技术架构及创新形态。文章指出智能底盘作为智能汽车的核心执行层,通过线控化(X-By-Wire)和域控化实现驱动、制动、转向、悬架的精准主动控制,支撑高阶智能驾驶落地。技术发展历经机械、机电混合到智能三个阶段,当前以线控转向、线控制动、域控制器等为核心,并辅以传感器、车规级芯片、功能安全等配套技术。文中还重点探讨了“智能滑板底盘”这一创新形态,强调其高度集成化、模块化优势及其在成本、灵活性、空间利用等方面的潜力。最后通过“2025智能底盘先锋计划”的实车测试案例,展示了智能底盘在真实场景中的安全与性能表现,推动技术从研发走向市场验证。; 适合人群:汽车电子工程师、智能汽车研发人员、新能源汽车领域技术人员及对智能底盘技术感兴趣的从业者;具备一定汽车工程或控制系统基础知识的专业人士。; 使用场景及目标:①深入了解智能底盘的技术演进路径与系统架构;②掌握线控技术、域控制器、滑板底盘等关键技术原理与应用场景;③为智能汽车底盘研发、系统集成与技术创新提供理论支持与实践参考。; 阅读建议:建议结合实际车型和技术标准进行延伸学习,关注政策导向与行业测试动态,注重理论与实车验证相结合,全面理解智能底盘从技术构想到商业化落地的全过程。
【顶级EI复现】计及连锁故障传播路径的电力系统 N-k 多阶段双层优化及故障场景筛选模型(Matlab代码实现)内容概要:本文介绍了名为《【顶级EI复现】计及连锁故障传播路径的电力系统 N-k 多阶段双层优化及故障场景筛选模型(Matlab代码实现)》的技术资源,重点围绕电力系统中连锁故障的传播路径展开研究,提出了一种N-k多阶段双层优化模型,并结合故障场景筛选方法,用于提升电力系统在复杂故障条件下的安全性与鲁棒性。该模型通过Matlab代码实现,具备较强的工程应用价值和学术参考意义,适用于电力系统风险评估、脆弱性分析及预防控制策略设计等场景。文中还列举了大量相关的科研技术支持方向,涵盖智能优化算法、机器学习、路径规划、信号处理、电力系统管理等多个领域,展示了广泛的仿真与复现能力。; 适合人群:具备电力系统、自动化、电气工程等相关背景,熟悉Matlab编程,有一定科研基础的研究生、高校教师及工程技术人员。; 使用场景及目标:①用于电力系统连锁故障建模与风险评估研究;②支撑高水平论文(如EI/SCI)的模型复现与算法验证;③为电网安全分析、故障传播防控提供优化决策工具;④结合YALMIP等工具进行数学规划求解,提升科研效率。; 阅读建议:建议读者结合提供的网盘资源,下载完整代码与案例进行实践操作,重点关注双层优化结构与场景筛选逻辑的设计思路,同时可参考文档中提及的其他复现案例拓展研究视野。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值