后缀树--来自Wiki

http://en.wikipedia.org/wiki/Suffix_tree 

 

Suffix tree

From Wikipedia, the free encyclopedia

Jump to: navigation, search
Suffix tree for the string BANANA padded with $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the boxes give the start position of the corresponding suffix. Suffix links drawn dashed.

In computer science, a suffix tree (also called suffix trie, PAT tree or, in an earlier form, position tree) is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.

The suffix tree for a string S is a tree whose edges are labeled with strings, and such that each suffix of S corresponds to exactly one path from the tree's root to a leaf. It is thus a radix tree (more specifically, a Patricia trie) for the suffixes of S.

Constructing such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

Contents

[hide]

[edit] History

The concept was first introduced as a position tree by Weiner in 1973[1] in a paper which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976 [2] , and also by Ukkonen in 1995[3][4]. Ukkonen provided the first linear-time online-construction of suffix trees, now known as Ukkonen's algorithm.

[edit] Definition

The suffix tree for the string S of length n is defined as a tree such that ([5] page 90):

  • the paths from the root to the leaves have a one-to-one relationship with the suffixes of S,
  • edges spell non-empty strings,
  • and all internal nodes (except perhaps the root) have at least two children.

Since such a tree does not exist for all strings, S is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be n leaf nodes, one for each of the n suffixes of S. Since all internal non-root nodes are branching, there can be at most n − 1 such nodes, and n + (n − 1) + 1 = 2n nodes in total.

Suffix links are a key feature for linear-time construction of the tree. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string χα, where χ is a single character and α is a string (possibly empty), it has a suffix link to the internal node representing α. See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.

[edit] Functionality

A suffix tree for a string S of length n can be built in Θ(n) time, if the alphabet is constant or integer [6]. Otherwise, the construction time depends on the implementation. The costs below are given under the assumption that the alphabet is constant. If it is not, the cost depends on the implementation (see below).

Assume that a suffix tree has been built for the string S of length n, or that a generalised suffix tree has been built for the set of strings D = {S1,S2,...,SK} of total length n = | n1 | + | n2 | + ... + | nK | . You can:

  • Search for strings:
    • Check if a string P of length m is a substring in O(m) time ([5] page 92).
    • Find the first occurrence of the patterns P1,...,Pq of total length m as substrings in O(m) time, when the suffix tree is built using Ukkonen's algorithm.
    • Find all z occurrences of the patterns P1,...,Pq of total length m as substrings in O(m + z) time ([5] page 123).
    • Search for a regular expression P in time expected sublinear in n ([7]).
    • Find for each suffix of a pattern P, the length of the longest match between a prefix of P[i...m] and a substring in D in Θ(m) time ([5] page 132). This is termed the matching statistics for P.
  • Find properties of the strings:
    • Find the longest common substrings of the string Si and Sj in Θ(ni + nj) time ([5] page 125).
    • Find all maximal pairs, maximal repeats or supermaximal repeats in Θ(n + z) time ([5] page 144).
    • Find the Lempel-Ziv decomposition in Θ(n) time ([5] page 166).
    • Find the longest repeated substrings in Θ(n) time.
    • Find the most frequently occurring substrings of a minimum length in Θ(n) time.
    • Find the shortest strings from Σ that do not occur in D, in O(n + z) time, if there are z such strings.
    • Find the shortest substrings occurring only once in Θ(n) time.
    • Find, for each i, the shortest substrings of Si not occurring elsewhere in D in Θ(n) time.

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in Θ(n) time ([5] chapter 8). You can then also:

  • Find the longest common prefix between the suffixes Si[p..ni] and Sj[q..nj] in Θ(1) ([5] page 196).
  • Search for a pattern P of length m with at most k mismatches in O(kn + z) time, where z is the number of hits ([5] page 200).
  • Find all z maximal palindromes in Θ(n)([5] page 198), or Θ(gn) time if gaps of length g are allowed, or Θ(kn) if k mismatches are allowed ([5] page 201).
  • Find all z tandem repeats in O(nlogn + z), and k-mismatch tandem repeats in O(knlog(n / k) + z) ([5] page 204).
  • Find the longest substrings common to at least k strings in D for k = 2..K in Θ(n) time ([5] page 205).

[edit] Uses

Suffix trees are often used in bioinformatics applications, where they are used for searching for patterns in DNA or protein sequences, which can be viewed as long strings of characters. The ability to search efficiently with mismatches might be the suffix tree's greatest strength. It is also used in data compression, where on the one hand it is used to find repeated data and on the other hand it can be used for the sorting stage of the Burrows-Wheeler transform. Variants of the LZW compression schemes use it (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines (first introduced in [8]).

[edit] Implementation

If each node and edge can be represented in Θ(1) space, the entire tree can be represented in Θ(n) space. The total length of the edges in the tree is O(n2), but each edge can be stored as the position and length of a substring of S, giving a total space usage of Θ(n) computer words. The worst-case space usage of a suffix tree is seen with a fibonacci string, giving the full 2n nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Hash maps, sorted/unsorted arrays (with array doubling), and balanced search trees may also be used, giving different running time properties. We are interested in:

  • The cost of finding the child on a given character.
  • The cost of inserting a child.
  • The cost of enlisting all children of a node (divided by the number of children in the table below).

Let σ be the size of the alphabet. Then you have the following costs:

 LookupInsertionTraversal
Sibling lists / unsorted arraysO(σ)Θ(1)Θ(1)
Hash mapsΘ(1)Θ(1)O(σ)
Balanced search treeO(logσ)O(logσ)O(1)
Sorted arraysO(logσ)O(σ)O(1)
Hash maps + sibling listsO(1)O(1)O(1)

Note that the insertion cost is amortised, and that the costs for hashing are given perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

[edit] See also

[edit] References

  1. ^ P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory: 1-11. 
  2. ^ Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): 262--272. 
  3. ^ E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): 249--260. 
  4. ^ R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): 331--353. 
  5. ^ a b c d e f g h i j k l m n Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8. 
  6. ^ Martin Farach (1997). "Optimal suffix tree construction with large alphabets". Foundations of Computer Science, 38th Annual Symposium on: 137--143. 
  7. ^ Ricardo A. Baeza-Yates and Gaston H. Gonnet (1996). "Fast text searching for regular expressions or automaton searching on tries". Journal of the ACM 43: 915--936. ACM Press. doi:10.1145/235809.235810. ISSN 0004-5411. 
  8. ^ Oren Zamir and Oren Etzioni (1998). "Web document clustering: a feasibility demonstration". SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval: 46--54, ACM. 

[edit] External links

源码地址: https://pan.quark.cn/s/a4b39357ea24 欧姆龙触摸屏编程软件MPTST 5.02是专门为欧姆龙品牌的工业触摸屏而研发的编程解决方案,它赋予用户在直观界面上构建、修改以及排错触摸屏应用程序的能力。 该软件在工业自动化领域具有不可替代的地位,特别是在生产线监视、设备操控以及人机互动系统中发挥着核心作用。 欧姆龙MPTST(Machine Process Terminal Software Touch)5.02版本配备了多样化的功能,旨在应对不同种类的触摸屏项目要求。 以下列举了若干核心特性:1. **图形化编程**:MPTST 5.02采用图形化的编程模式,允许用户借助拖拽动作来设计屏幕布局,设定按钮、滑块、指示灯等组件,显著简化了编程流程,并提升了工作效率。 2. **兼容性**:该软件能够适配欧姆龙的多个触摸屏产品线,包括CX-One、NS系列、NJ/NX系列等,使用户可以在同一个平台上完成对不同硬件的编程任务。 3. **数据通信**:MPTST 5.02具备与PLC(可编程逻辑控制器)进行数据交互的能力,通过将触摸屏作为操作界面,实现生产数据的显示与输入,以及设备状态的监控。 4. **报警与事件管理**:软件中集成了报警和事件管理机制,可以设定多种报警标准,一旦达到预设条件,触摸屏便会展示对应的报警提示,助力操作人员迅速做出响应。 5. **模拟测试**:在设备实际连接之前,MPTST 5.02支持用户进行脱机模拟测试,以此验证程序的正确性与稳定性。 6. **项目备份与恢复**:为了防止数据遗失,MPTST 5.02提供了项目文件的备份及还原功能,对于多版本控制与团队协作具有显著价值。 7. **多语言支持**:针对全球化的应...
本资源包为流体力学与化学传质交叉领域的研究提供了一套完整的数值模拟解决方案,重点针对湍流条件下通道内溶解物质的输运与分布规律进行定量分析。该工具集专为高等院校理工科专业的教育与科研需求设计,尤其适合计算机科学、电子工程及数学等相关学科的本科生在完成课程项目、综合设计或学位论文时使用。 软件环境兼容多个版本的MatLAB平台,包括2014a、2019b及后续的2024b发行版,确保了在不同实验室或个人计算环境中的可移植性。资源包内预置了经过验证的示例数据集,用户可直接调用主程序执行计算,显著降低了初始学习成本,使初学者能够迅速掌握基本操作流程。 代码架构采用模块化与参数驱动设计。所有关键物理参数(如流速、扩散系数、边界条件等)均集中于独立的配置模块,用户无需深入底层算法即可灵活调整计算条件,从而高效模拟多种湍流溶解场景。程序逻辑结构清晰,各功能段均配有详尽的说明注释,既阐述了数值方法的理论依据,也解释了关键步骤的实现意图,便于使用者理解模型构建过程并进行针对性修改。 在学术训练方面,本工具能够帮助学生将抽象的流体动力学与传质理论转化为可视化的数值实验结果,深化对湍流混合、浓度边界层等概念的理解。对于毕业设计或专题研究,其参数化框架支持用户嵌入自定义模型,开展创新性数值实验,为深入研究复杂流动中的溶解机制提供可靠的技术支撑。 总体而言,该MATLAB分析工具集通过结构化的代码设计、完备的案例支持与广泛的版本兼容性,为流体溶解现象的数值研究提供了一个高效、可扩展的计算平台,兼具教学示范与科研探索的双重价值。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
标题JSPM自行车个性化改装推荐系统研究AI更换标题第1章引言介绍自行车个性化改装推荐系统的研究背景、意义及国内外研究现状。1.1研究背景与意义阐述自行车个性化改装需求增长及推荐系统的重要性。1.2国内外研究现状分析国内外自行车改装推荐系统的研究进展及不足。1.3研究方法及创新点概述JSPM系统的设计方法及相较于其他系统的创新点。第2章相关理论介绍与自行车个性化改装推荐系统相关的理论基础。2.1个性化推荐理论阐述个性化推荐的基本原理和常用算法。2.2自行车改装知识介绍自行车结构、部件及改装选项等基础知识。2.3用户偏好分析理论讨论如何分析用户偏好以实现精准推荐。第3章JSPM系统设计详细介绍JSPM自行车个性化改装推荐系统的设计方案。3.1系统架构设计阐述系统的整体架构、模块划分及功能。3.2数据库设计介绍系统数据库的设计思路、表结构及关系。3.3推荐算法设计详细介绍基于用户偏好的推荐算法实现过程。第4章系统实现与测试介绍JSPM系统的实现过程及测试方法。4.1系统开发环境与工具说明系统开发所使用的环境、工具及技术栈。4.2系统实现过程阐述系统从设计到实现的具体步骤和关键代码。4.3系统测试与优化介绍系统的测试方法、测试结果及优化措施。第5章研究结果与分析展示JSPM系统的实验分析结果并进行讨论。5.1实验数据与指标介绍实验所采用的数据集、评估指标及实验环境。5.2实验结果展示通过图表等形式展示实验结果,包括推荐准确率等。5.3结果分析与讨论对实验结果进行详细分析,讨论系统的优缺点及改进方向。第6章结论与展望总结JSPM自行车个性化改装推荐系统的研究成果并展望未来。6.1研究结论概括本文的主要研究成果,包括系统设计、实现及实验结果。6.2展望指出系统存在的不足,提出未来研究的方向和改进措施。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值