本文翻译自:Ukkonen's suffix tree algorithm in plain English
I feel a bit thick at this point. 在这一点上我感觉有点浓。 I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. 我花了几天的时间试图完全围绕后缀树构造,但是由于我没有数学背景,因此许多解释都使我难以理解,因为它们开始过度使用数学符号系统。 The closest to a good explanation that I've found is Fast String Searching With Suffix Trees , but he glosses over various points and some aspects of the algorithm remain unclear. 我发现的最接近很好的解释是带有后缀树的快速字符串搜索 ,但是他掩盖了各个要点,并且算法的某些方面仍然不清楚。
A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure. 我敢肯定,在堆栈溢出上对此算法的分步说明对我以外的其他许多人来说都是无价的。
For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf 作为参考,这里是有关算法的Ukkonen论文: http : //www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
My basic understanding, so far: 到目前为止,我的基本了解:
- I need to iterate through each prefix P of a given string T 我需要遍历给定字符串T的每个前缀P
- I need to iterate through each suffix S in prefix P and add that to tree 我需要遍历前缀P中的每个后缀S并将其添加到树中
- To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. 要将后缀S添加到树中,我需要遍历S中的每个字符,其中的迭代包括沿着以S中相同的字符集C开头的现有分支以及当我将边缘拆分成后代节点时进行在后缀中找到一个不同的字符,或者如果没有匹配的边要走。 When no matching edge is found to walk down for C, a new leaf edge is created for C. 当找不到匹配的边沿C向下走时,将为C创建新的叶边。
The basic algorithm appears to be O(n 2 ), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. 正如大多数解释中所指出的那样,基本算法似乎是O(n 2 ),因为我们需要逐步处理所有前缀,然后才需要逐步处理每个前缀的每个后缀。 Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I'm having trouble understanding. Ukkonen的算法显然是独特的,因为他使用了后缀指针技术,尽管我认为这是我难以理解的。
I'm also having trouble understanding: 我也很难理解:
- exactly when and how the "active point" is assigned, used and changed 准确地分配,使用和更改“活动点”的时间和方式
- what is going on with the canonization aspect of the algorithm 该算法的规范化方面发生了什么
- Why the implementations I've seen need to "fix" bounding variables that they are using 为什么我看到的实现需要“修复”他们使用的边界变量
Here is the completed C# source code. 这是完整的C#源代码。 It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. 它不仅可以正常工作,而且支持自动规范化,并呈现输出的外观更好的文本图。 Source code and sample output is at: 源代码和示例输出位于:
https://gist.github.com/2373868 https://gist.github.com/2373868
Update 2017-11-04 更新2017-11-04
After many years I've found a new use for suffix trees, and have implemented the algorithm in JavaScript . 多年后,我发现了后缀树的新用法,并在JavaScript中实现了该算法。 Gist is below. 要点在下面。 It should be bug-free. 它应该没有错误。 Dump it into a js file, npm install chalk
from the same location, and then run with node.js to see some colourful output. 将其转储到js文件中, npm install chalk
从同一位置npm install chalk
,然后与node.js一起运行以查看一些彩色输出。 There's a stripped down version in the same Gist, without any of the debugging code. 在同一个Gist中有一个精简版,没有任何调试代码。
https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6 https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6
#1楼
参考:https://stackoom.com/question/df4v/普通英语的Ukkonen后缀树算法
#2楼
I tried to implement the Suffix Tree with the approach given in jogojapan's answer, but it didn't work for some cases due to wording used for the rules. 我尝试使用jogojapan的答案中给出的方法来实现后缀树,但是由于规则中的措辞,它在某些情况下不起作用。 Moreover, I've mentioned that nobody managed to implement an absolutely correct suffix tree using this approach. 而且,我已经提到没有人设法使用这种方法来实现绝对正确的后缀树。 Below I will write an "overview" of jogojapan's answer with some modifications to the rules. 下面,我将对jogojapan的答案进行“概述”,并对规则进行一些修改。 I will also describe the case when we forget to create important suffix links. 我还将描述当我们忘记创建重要的后缀链接时的情况。
Additional variables used 使用的其他变量
- active point - a triple (active_node; active_edge; active_length), showing from where we must start inserting a new suffix. 活动点 -一个三元组(active_node; active_edge; active_length),显示我们必须从哪里开始插入新的后缀。
- remainder - shows the number of suffixes we must add explicitly . 剩余数 -显示必须显式添加的后缀数。 For instance, if our word is 'abcaabca', and remainder = 3, it means we must process 3 last suffixes: bca , ca and a . 例如,如果我们的单词是“ abcaabca”,而remainder = 3,则意味着我们必须处理3个后缀: bca , ca和a 。
Let's use a concept of an internal node - all the nodes, except the root and the leafs are internal nodes . 让我们使用内部节点的概念-除了根和叶子都是内部节点之外的所有节点 。
Observation 1 观察1
When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point
and remainder
). 当发现我们需要插入的最后一个后缀已经存在于树中时,树本身根本不会改变(我们只更新active point
, remainder
)。
Observation 2 观察2
If at some point active_length
is greater or equal to the length of current edge ( edge_length
), we move our active point
down until edge_length
is strictly greater than active_length
. 如果在某个点上active_length
大于或等于当前edge的长度( edge_length
), edge_length
active point
向下移动,直到edge_length
严格大于active_length
。
Now, let's redefine the rules: 现在,让我们重新定义规则:
Rule 1 规则1
If after an insertion from the active node = root , the active length is greater than 0, then: 如果从活动节点 = root插入后, 活动长度大于0,则:
- active node is not changed 活动节点未更改
- active length is decremented 有效长度递减
- active edge is shifted right (to the first character of the next suffix we must insert) 活动边右移(到我们必须插入的下一个后缀的第一个字符)
Rule 2 规则二
If we create a new internal node OR make an inserter from an internal node , and this is not the first SUCH internal node at current step, then we link the previous SUCH node with THIS one through a suffix link . 如果我们创建一个新的内部节点 或从一个内部节点插入一个插入器,而这不是当前步骤中的第一个SUCH 内部节点 ,则可以通过后缀链接将先前的SUCH节点与THIS 链接起来 。
This definition of the Rule 2
is different from jogojapan', as here we take into account not only the newly created internal nodes, but also the internal nodes, from which we make an insertion. Rule 2
定义不同于jogojapan',因为在这里,我们不仅考虑了新创建的内部节点,而且还考虑了从中插入的内部节点。
Rule 3 规则三
After an insert from the active node which is not the root node, we must follow the suffix link and set the active node to the node it points to. 从不是根节点的活动节点插入后,我们必须遵循后缀链接并将活动节点设置为它指向的节点。 If there is no a suffix link, set the active node to the root node. 如果不存在一个链路后缀,设置活动节点到根节点 。 Either way, active edge and active length stay unchanged. 无论哪种方式, 有效边沿和有效长度均保持不变。
In this definition of Rule 3
we also consider the inserts of leaf nodes (not only split-nodes). 在Rule 3
此定义中,我们还考虑了叶节点(不仅是分割节点)的插入。
And finally, Observation 3: 最后,观察3:
When the symbol we want to add to the tree is already on the edge, we, according to Observation 1
, update only active point
and remainder
, leaving the tree unchanged. 当我们要添加到树上的符号已经在边缘上时,根据Observation 1
,我们仅更新active point
和remainder
,而使树保持不变。 BUT if there is an internal node marked as needing suffix link , we must connect that node with our current active node
through a suffix link. 但是,如果有一个内部节点标记为需要后缀链接 ,则必须通过后缀链接将该节点与当前active node
连接。
Let's look at the example of a suffix tree for cdddcdc if we add a suffix link in such case and if we don't: 让我们看一下cdddcdc的后缀树的示例,如果我们在这种情况下添加了后缀链接,而没有这样做的话:
If we DON'T connect the nodes through a suffix link: 如果我们不通过后缀链接连接节点:
- before adding the last letter c : 在添加最后一个字母c之前 :
- after adding the last letter c : 添加最后一个字母c后 :
If we DO connect the nodes through a suffix link: 如果我们确实通过后缀链接连接节点:
- before adding the last letter c : 在添加最后一个字母c之前 :
- after adding the last letter c : 添加最后一个字母c后 :
Seems like there is no significant difference: in the second case there are two more suffix links. 似乎没有明显的区别:在第二种情况下,还有两个后缀链接。 But these suffix links are correct , and one of them - from the blue node to the red one - is very important for our approach with active point . 但是这些后缀链接是正确的 ,其中一个-从蓝色节点到红色一个-对我们的主动点方法非常重要 。 The problem is that if we don't put a suffix link here, later, when we add some new letters to the tree, we might omit adding some nodes to the tree due to the Rule 3
, because, according to it, if there's no a suffix link, then we must put the active_node
to the root. 问题是,如果我们不在此处添加后缀链接,则稍后,当我们向树中添加一些新字母时,由于Rule 3
,我们可能会省略向树中添加一些节点的原因,因为据此,如果存在没有后缀链接,那么我们必须将active_node
放到根目录。
When we were adding the last letter to the tree, the red node had already existed before we made an insert from the blue node (the edge labled 'c' ). 当我们将最后一个字母添加到树中时,红色节点已经存在,然后再从蓝色节点进行插入(边缘标记为'c' )。 As there was an insert from the blue node, we mark it as needing a suffix link . 由于蓝色节点有插入物,因此我们将其标记为需要后缀链接 。 Then, relying on the active point approach, the active node
was set to the red node. 然后,依靠活动点方法,将active node
设置为红色节点。 But we don't make an insert from the red node, as the letter 'c' is already on the edge. 但是,由于字母'c'已经在边缘,因此我们不会从红色节点插入。 Does it mean that the blue node must be left without a suffix link? 这是否意味着蓝色节点必须不带后缀链接? No, we must connect the blue node with the red one through a suffix link. 不,我们必须通过后缀链接将蓝色节点与红色节点相连。 Why is it correct? 为什么正确? Because the active point approach guarantees that we get to a right place, ie, to the next place where we must process an insert of a shorter suffix. 因为主动点方法保证了我们可以到达一个正确的位置,即到达必须处理较短后缀的下一个位置。
Finally, here are my implementations of the Suffix Tree: 最后,这是我对后缀树的实现:
Hope that this "overview" combined with jogojapan's detailed answer will help somebody to implement his own Suffix Tree. 希望这种“概述”与jogojapan的详细答案相结合,将有助于某人实现自己的后缀树。
#3楼
Hi i have tried to implement the above explained implementation in ruby , please check it out. 您好我已经尝试在ruby中实现上述解释的实现,请检查一下。 it seems to work fine. 它似乎工作正常。
the only difference in the implementation is that , i have tried to use the edge object instead of just using symbols. 实现的唯一区别是,我尝试使用edge对象而不是仅使用符号。
its also present at https://gist.github.com/suchitpuri/9304856 它也存在于https://gist.github.com/suchitpuri/9304856
require 'pry'
class Edge
attr_accessor :data , :edges , :suffix_link
def initialize data
@data = data
@edges = []
@suffix_link = nil
end
def find_edge element
self.edges.each do |edge|
return edge if edge.data.start_with? element
end
return nil
end
end
class SuffixTrees
attr_accessor :root , :active_point , :remainder , :pending_prefixes , :last_split_edge , :remainder
def initialize
@root = Edge.new nil
@active_point = { active_node: @root , active_edge: nil , active_length: 0}
@remainder = 0
@pending_prefixes = []
@last_split_edge = nil
@remainder = 1
end
def build string
string.split("").each_with_index do |element , index|
add_to_edges @root , element
update_pending_prefix element
add_pending_elements_to_tree element
active_length = @active_point[:active_length]
# if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data[0..active_length-1] == @active_point[:active_edge].data[active_length..@active_point[:active_edge].data.length-1])
# @active_point[:active_edge].data = @active_point[:active_edge].data[0..active_length-1]
# @active_point[:active_edge].edges << Edge.new(@active_point[:active_edge].data)
# end
if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data.length == @active_point[:active_length] )
@active_point[:active_node] = @active_point[:active_edge]
@active_point[:active_edge] = @active_point[:active_node].find_edge(element[0])
@active_point[:active_length] = 0
end
end
end
def add_pending_elements_to_tree element
to_be_deleted = []
update_active_length = false
# binding.pry
if( @active_point[:active_node].find_edge(element[0]) != nil)
@active_point[:active_length] = @active_point[:active_length] + 1
@active_point[:active_edge] = @active_point[:active_node].find_edge(element[0]) if @active_point[:active_edge] == nil
@remainder = @remainder + 1
return
end
@pending_prefixes.each_with_index do |pending_prefix , index|
# binding.pry
if @active_point[:active_edge] == nil and @active_point[:active_node].find_edge(element[0]) == nil
@active_point[:active_node].edges << Edge.new(element)
else
@active_point[:active_edge] = node.find_edge(element[0]) if @active_point[:active_edge] == nil
data = @active_point[:active_edge].data
data = data.split("")
location = @active_point[:active_length]
# binding.pry
if(data[0..location].join == pending_prefix or @active_point[:active_node].find_edge(element) != nil )
else #tree split
split_edge data , index , element
end
end
end
end
def update_pending_prefix element
if @active_point[:active_edge] == nil
@pending_prefixes = [element]
return
end
@pending_prefixes = []
length = @active_point[:active_edge].data.length
data = @active_point[:active_edge].data
@remainder.times do |ctr|
@pending_prefixes << data[-(ctr+1)..data.length-1]
end
@pending_prefixes.reverse!
end
def split_edge data , index , element
location = @active_point[:active_length]
old_edges = []
internal_node = (@active_point[:active_edge].edges != nil)
if (internal_node)
old_edges = @active_point[:active_edge].edges
@active_point[:active_edge].edges = []
end
@active_point[:active_edge].data = data[0..location-1].join
@active_point[:active_edge].edges << Edge.new(data[location..data.size].join)
if internal_node
@active_point[:active_edge].edges << Edge.new(element)
else
@active_point[:active_edge].edges << Edge.new(data.last)
end
if internal_node
@active_point[:active_edge].edges[0].edges = old_edges
end
#setup the suffix link
if @last_split_edge != nil and @last_split_edge.data.end_with?@active_point[:active_edge].data
@last_split_edge.suffix_link = @active_point[:active_edge]
end
@last_split_edge = @active_point[:active_edge]
update_active_point index
end
def update_active_point index
if(@active_point[:active_node] == @root)
@active_point[:active_length] = @active_point[:active_length] - 1
@remainder = @remainder - 1
@active_point[:active_edge] = @active_point[:active_node].find_edge(@pending_prefixes.first[index+1])
else
if @active_point[:active_node].suffix_link != nil
@active_point[:active_node] = @active_point[:active_node].suffix_link
else
@active_point[:active_node] = @root
end
@active_point[:active_edge] = @active_point[:active_node].find_edge(@active_point[:active_edge].data[0])
@remainder = @remainder - 1
end
end
def add_to_edges root , element
return if root == nil
root.data = root.data + element if(root.data and root.edges.size == 0)
root.edges.each do |edge|
add_to_edges edge , element
end
end
end
suffix_tree = SuffixTrees.new
suffix_tree.build("abcabxabcd")
binding.pry
#4楼
@jogojapan you brought awesome explanation and visualisation. @jogojapan,您带来了很棒的解释和可视化。 But as @makagonov mentioned it's missing some rules regarding setting suffix links. 但是正如@makagonov所说,它缺少一些有关设置后缀链接的规则。 It's visible in nice way when going step by step on http://brenden.github.io/ukkonen-animation/ through word 'aabaaabb'. 通过“ aabaaabb”一词逐步浏览http://brenden.github.io/ukkonen-animation/时,可以很好地看到它。 When you go from step 10 to step 11, there is no suffix link from node 5 to node 2 but active point suddenly moves there. 当您从步骤10转到步骤11时,没有从节点5到节点2的后缀链接,但是活动点突然在那里移动。
@makagonov since I live in Java world I also tried to follow your implementation to grasp ST building workflow but it was hard to me because of: @makagonov,因为我生活在Java世界中,所以我也尝试遵循您的实现来掌握ST构建工作流程,但由于以下原因,我感到很难:
- combining edges with nodes 结合边缘和节点
- using index pointers instead of references 使用索引指针而不是引用
- breaks statements; 破坏陈述;
- continue statements; 继续陈述;
So I ended up with such implementation in Java which I hope reflects all steps in clearer way and will reduce learning time for other Java people: 因此,我最终用Java实现了这种实现,希望以更清晰的方式反映所有步骤,并减少其他Java人员的学习时间:
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
public class ST {
public class Node {
private final int id;
private final Map<Character, Edge> edges;
private Node slink;
public Node(final int id) {
this.id = id;
this.edges = new HashMap<>();
}
public void setSlink(final Node slink) {
this.slink = slink;
}
public Map<Character, Edge> getEdges() {
return this.edges;
}
public Node getSlink() {
return this.slink;
}
public String toString(final String word) {
return new StringBuilder()
.append("{")
.append("\"id\"")
.append(":")
.append(this.id)
.append(",")
.append("\"slink\"")
.append(":")
.append(this.slink != null ? this.slink.id : null)
.append(",")
.append("\"edges\"")
.append(":")
.append(edgesToString(word))
.append("}")
.toString();
}
private StringBuilder edgesToString(final String word) {
final StringBuilder edgesStringBuilder = new StringBuilder();
edgesStringBuilder.append("{");
for(final Map.Entry<Character, Edge> entry : this.edges.entrySet()) {
edgesStringBuilder.append("\"")
.append(entry.getKey())
.append("\"")
.append(":")
.append(entry.getValue().toString(word))
.append(",");
}
if(!this.edges.isEmpty()) {
edgesStringBuilder.deleteCharAt(edgesStringBuilder.length() - 1);
}
edgesStringBuilder.append("}");
return edgesStringBuilder;
}
public boolean contains(final String word, final String suffix) {
return !suffix.isEmpty()
&& this.edges.containsKey(suffix.charAt(0))
&& this.edges.get(suffix.charAt(0)).contains(word, suffix);
}
}
public class Edge {
private final int from;
private final int to;
private final Node next;
public Edge(final int from, final int to, final Node next) {
this.from = from;
this.to = to;
this.next = next;
}
public int getFrom() {
return this.from;
}
public int getTo() {
return this.to;
}
public Node getNext() {
return this.next;
}
public int getLength() {
return this.to - this.from;
}
public String toString(final String word) {
return new StringBuilder()
.append("{")
.append("\"content\"")
.append(":")
.append("\"")
.append(word.substring(this.from, this.to))
.append("\"")
.append(",")
.append("\"next\"")
.append(":")
.append(this.next != null ? this.next.toString(word) : null)
.append("}")
.toString();
}
public boolean contains(final String word, final String suffix) {
if(this.next == null) {
return word.substring(this.from, this.to).equals(suffix);
}
return suffix.startsWith(word.substring(this.from,
this.to)) && this.next.contains(word, suffix.substring(this.to - this.from));
}
}
public class ActivePoint {
private final Node activeNode;
private final Character activeEdgeFirstCharacter;
private final int activeLength;
public ActivePoint(final Node activeNode,
final Character activeEdgeFirstCharacter,
final int activeLength) {
this.activeNode = activeNode;
this.activeEdgeFirstCharacter = activeEdgeFirstCharacter;
this.activeLength = activeLength;
}
private Edge getActiveEdge() {
return this.activeNode.getEdges().get(this.activeEdgeFirstCharacter);
}
public boolean pointsToActiveNode() {
return this.activeLength == 0;
}
public boolean activeNodeIs(final Node node) {
return this.activeNode == node;
}
public boolean activeNodeHasEdgeStartingWith(final char character) {
return this.activeNode.getEdges().containsKey(character);
}
public boolean activeNodeHasSlink() {
return this.activeNode.getSlink() != null;
}
public boolean pointsToOnActiveEdge(final String word, final char character) {
return word.charAt(this.getActiveEdge().getFrom() + this.activeLength) == character;
}
public boolean pointsToTheEndOfActiveEdge() {
return this.getActiveEdge().getLength() == this.activeLength;
}
public boolean pointsAfterTheEndOfActiveEdge() {
return this.getActiveEdge().getLength() < this.activeLength;
}
public ActivePoint moveToEdgeStartingWithAndByOne(final char character) {
return new ActivePoint(this.activeNode, character, 1);
}
public ActivePoint moveToNextNodeOfActiveEdge() {
return new ActivePoint(this.getActiveEdge().getNext(), null, 0);
}
public ActivePoint moveToSlink() {
return new ActivePoint(this.activeNode.getSlink(),
this.activeEdgeFirstCharacter,
this.activeLength);
}
public ActivePoint moveTo(final Node node) {
return new ActivePoint(node, this.activeEdgeFirstCharacter, this.activeLength);
}
public ActivePoint moveByOneCharacter() {
return new ActivePoint(this.activeNode,
this.activeEdgeFirstCharacter,
this.activeLength + 1);
}
public ActivePoint moveToEdgeStartingWithAndByActiveLengthMinusOne(final Node node,
final char character) {
return new ActivePoint(node, character, this.activeLength - 1);
}
public ActivePoint moveToNextNodeOfActiveEdge(final String word, final int index) {
return new ActivePoint(this.getActiveEdge().getNext(),
word.charAt(index - this.activeLength + this.getActiveEdge().getLength()),
this.activeLength - this.getActiveEdge().getLength());
}
public void addEdgeToActiveNode(final char character, final Edge edge) {
this.activeNode.getEdges().put(character, edge);
}
public void splitActiveEdge(final String word,
final Node nodeToAdd,
final int index,
final char character) {
final Edge activeEdgeToSplit = this.getActiveEdge();
final Edge splittedEdge = new Edge(activeEdgeToSplit.getFrom(),
activeEdgeToSplit.getFrom() + this.activeLength,
nodeToAdd);
nodeToAdd.getEdges().put(word.charAt(activeEdgeToSplit.getFrom() + this.activeLength),
new Edge(activeEdgeToSplit.getFrom() + this.activeLength,
activeEdgeToSplit.getTo(),
activeEdgeToSplit.getNext()));
nodeToAdd.getEdges().put(character, new Edge(index, word.length(), null));
this.activeNode.getEdges().put(this.activeEdgeFirstCharacter, splittedEdge);
}
public Node setSlinkTo(final Node previouslyAddedNodeOrAddedEdgeNode,
final Node node) {
if(previouslyAddedNodeOrAddedEdgeNode != null) {
previouslyAddedNodeOrAddedEdgeNode.setSlink(node);
}
return node;
}
public Node setSlinkToActiveNode(final Node previouslyAddedNodeOrAddedEdgeNode) {
return setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, this.activeNode);
}
}
private static int idGenerator;
private final String word;
private final Node root;
private ActivePoint activePoint;
private int remainder;
public ST(final String word) {
this.word = word;
this.root = new Node(idGenerator++);
this.activePoint = new ActivePoint(this.root, null, 0);
this.remainder = 0;
build();
}
private void build() {
for(int i = 0; i < this.word.length(); i++) {
add(i, this.word.charAt(i));
}
}
private void add(final int index, final char character) {
this.remainder++;
boolean characterFoundInTheTree = false;
Node previouslyAddedNodeOrAddedEdgeNode = null;
while(!characterFoundInTheTree && this.remainder > 0) {
if(this.activePoint.pointsToActiveNode()) {
if(this.activePoint.activeNodeHasEdgeStartingWith(character)) {
activeNodeHasEdgeStartingWithCharacter(character, previouslyAddedNodeOrAddedEdgeNode);
characterFoundInTheTree = true;
}
else {
if(this.activePoint.activeNodeIs(this.root)) {
rootNodeHasNotEdgeStartingWithCharacter(index, character);
}
else {
previouslyAddedNodeOrAddedEdgeNode = internalNodeHasNotEdgeStartingWithCharacter(index,
character, previouslyAddedNodeOrAddedEdgeNode);
}
}
}
else {
if(this.activePoint.pointsToOnActiveEdge(this.word, character)) {
activeEdgeHasCharacter();
characterFoundInTheTree = true;
}
else {
if(this.activePoint.activeNodeIs(this.root)) {
previouslyAddedNodeOrAddedEdgeNode = edgeFromRootNodeHasNotCharacter(index,
character,
previouslyAddedNodeOrAddedEdgeNode);
}
else {
previouslyAddedNodeOrAddedEdgeNode = edgeFromInternalNodeHasNotCharacter(index,
character,
previouslyAddedNodeOrAddedEdgeNode);
}
}
}
}
}
private void activeNodeHasEdgeStartingWithCharacter(final char character,
final Node previouslyAddedNodeOrAddedEdgeNode) {
this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
this.activePoint = this.activePoint.moveToEdgeStartingWithAndByOne(character);
if(this.activePoint.pointsToTheEndOfActiveEdge()) {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
}
}
private void rootNodeHasNotEdgeStartingWithCharacter(final int index, final char character) {
this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
this.activePoint = this.activePoint.moveTo(this.root);
this.remainder--;
assert this.remainder == 0;
}
private Node internalNodeHasNotEdgeStartingWithCharacter(final int index,
final char character,
Node previouslyAddedNodeOrAddedEdgeNode) {
this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
if(this.activePoint.activeNodeHasSlink()) {
this.activePoint = this.activePoint.moveToSlink();
}
else {
this.activePoint = this.activePoint.moveTo(this.root);
}
this.remainder--;
return previouslyAddedNodeOrAddedEdgeNode;
}
private void activeEdgeHasCharacter() {
this.activePoint = this.activePoint.moveByOneCharacter();
if(this.activePoint.pointsToTheEndOfActiveEdge()) {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
}
}
private Node edgeFromRootNodeHasNotCharacter(final int index,
final char character,
Node previouslyAddedNodeOrAddedEdgeNode) {
final Node newNode = new Node(idGenerator++);
this.activePoint.splitActiveEdge(this.word, newNode, index, character);
previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
this.activePoint = this.activePoint.moveToEdgeStartingWithAndByActiveLengthMinusOne(this.root,
this.word.charAt(index - this.remainder + 2));
this.activePoint = walkDown(index);
this.remainder--;
return previouslyAddedNodeOrAddedEdgeNode;
}
private Node edgeFromInternalNodeHasNotCharacter(final int index,
final char character,
Node previouslyAddedNodeOrAddedEdgeNode) {
final Node newNode = new Node(idGenerator++);
this.activePoint.splitActiveEdge(this.word, newNode, index, character);
previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
if(this.activePoint.activeNodeHasSlink()) {
this.activePoint = this.activePoint.moveToSlink();
}
else {
this.activePoint = this.activePoint.moveTo(this.root);
}
this.activePoint = walkDown(index);
this.remainder--;
return previouslyAddedNodeOrAddedEdgeNode;
}
private ActivePoint walkDown(final int index) {
while(!this.activePoint.pointsToActiveNode()
&& (this.activePoint.pointsToTheEndOfActiveEdge() || this.activePoint.pointsAfterTheEndOfActiveEdge())) {
if(this.activePoint.pointsAfterTheEndOfActiveEdge()) {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge(this.word, index);
}
else {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
}
}
return this.activePoint;
}
public String toString(final String word) {
return this.root.toString(word);
}
public boolean contains(final String suffix) {
return this.root.contains(this.word, suffix);
}
public static void main(final String[] args) {
final String[] words = {
"abcabcabc$",
"abc$",
"abcabxabcd$",
"abcabxabda$",
"abcabxad$",
"aabaaabb$",
"aababcabcd$",
"ababcabcd$",
"abccba$",
"mississipi$",
"abacabadabacabae$",
"abcabcd$",
"00132220$"
};
Arrays.stream(words).forEach(word -> {
System.out.println("Building suffix tree for word: " + word);
final ST suffixTree = new ST(word);
System.out.println("Suffix tree: " + suffixTree.toString(word));
for(int i = 0; i < word.length() - 1; i++) {
assert suffixTree.contains(word.substring(i)) : word.substring(i);
}
});
}
}
#5楼
Thanks for the well explained tutorial by @jogojapan , I implemented the algorithm in Python. 感谢@jogojapan精心解释的教程,我用Python实现了该算法。
A couple of minor problems mentioned by @jogojapan turns out to be more sophisticated than I have expected, and need to be treated very carefully. @jogojapan提到的几个小问题比我预期的要复杂得多,需要非常仔细地对待。 It cost me several days to get my implementation robust enough (I suppose). 我花了几天的时间才能使我的实现足够强大 (我想)。 Problems and solutions are listed below: 问题和解决方案如下:
End with
Remainder > 0
It turns out this situation can also happen during the unfolding step , not just the end of the entire algorithm. 以Remainder > 0
结尾事实证明,这种情况也可能在展开步骤中发生,而不仅仅是整个算法的结束。 When that happens, we can leave the remainder, actnode, actedge, and actlength unchanged , end the current unfolding step, and start another step by either keep folding or unfolding depending on if the next char in the original string is on the current path or not. 发生这种情况时,我们可以使其余部分,actnode,actedge和actlength 保持不变 ,结束当前的展开步骤,并根据原始字符串中的下一个char是否在当前路径上,通过继续折叠还是展开来开始下一步。不。Leap Over Nodes: When we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. 跨越节点:当我们跟随一个后缀链接时,更新活动点,然后发现其active_length组件不能与新的active_node很好地配合。 We have to move forward to the right place to split, or insert a leaf. 我们必须向前移动到正确的位置才能拆分或插入叶子。 This process might be not that straightforward because during the moving the actlength and actedge keep changing all the way, when you have to move back to the root node , the actedge and actlength could be wrong because of those moves. 这个过程可能不是那么简单,因为在移动过程中,actlength和actedge一直在变化,当您不得不移回根节点时 ,由于这些移动, actedge和actlength可能是错误的。 We need additional variable(s) to keep that information. 我们需要其他变量来保留该信息。
The other two problems have somehow been pointed out by @managonov @managonov指出了其他两个问题
Split Could Degenerate When trying to split an edge, sometime you'll find the split operation is right on a node. 拆分可能会退化当尝试拆分边缘时,有时您会发现拆分操作正好在节点上。 That case we only need add a new leaf to that node, take it as a standard edge split operation, which means the suffix links if there's any, should be maintained correspondingly. 在这种情况下,我们只需要向该节点添加一个新叶子,将其作为标准的边缘拆分操作即可,这意味着后缀链接(如果有的话)应进行相应维护。
Hidden Suffix Links There is another special case which is incurred by problem 1 and problem 2 . 隐藏的后缀链接 问题1和问题2还有另一种特殊情况。 Sometimes we need to hop over several nodes to the right point for split, we might surpass the right point if we move by comparing the remainder string and the path labels. 有时我们需要跳过几个节点到正确的点进行拆分,如果我们通过比较其余字符串和路径标签来移动,则可能会超过正确的点。 That case the suffix link will be neglected unintentionally, if there should be any. 在这种情况下,后缀链接将被无意忽略,如果有的话。 This could be avoided by remembering the right point when moving forward. 通过记住前进时的正确点可以避免这种情况。 The suffix link should be maintained if the split node already exists, or even the problem 1 happens during a unfolding step. 如果拆分节点已经存在,或者即使问题1在展开步骤中发生,则应保留后缀链接。
Finally, my implementation in Python is as follows: 最后,我在Python中的实现如下:
Tips: It includes a naive tree printing function in the code above, which is very important while debugging . 提示: 上面的代码中包含朴素树打印功能,这在调试时非常重要 。 It saved me a lot of time and is convenient for locating special cases. 它为我节省了很多时间,并且方便查找特殊情况。
#6楼
Apologies if my answer seems redundant, but I implemented Ukkonen's algorithm recently, and found myself struggling with it for days; 道歉,如果我的回答似乎多余,但我最近实施了Ukkonen的算法,发现自己已经为此苦苦挣扎了好几天。 I had to read through multiple papers on the subject to understand the why and how of some core aspects of the algorithm. 我必须通读有关该主题的多篇论文,以了解该算法某些核心方面的原因和方式。
I found the 'rules' approach of previous answers unhelpful for understanding the underlying reasons , so I've written everything below focusing solely on the pragmatics. 我发现先前答案的“规则”方法无助于理解其根本原因 ,因此,我在下文中仅着重于语用学方面的内容。 If you've struggled with following other explanations, just like I did, perhaps my supplemental explanation will make it 'click' for you. 如果您像我一样努力遵循其他说明,也许我的补充说明会为您“点击”。
I published my C# implementation here: https://github.com/baratgabor/SuffixTree 我在这里发布了C#实现: https : //github.com/baratgabor/SuffixTree
Please note that I'm not an expert on this subject, so the following sections may contain inaccuracies (or worse). 请注意,我不是该主题的专家,因此以下各节可能包含错误(或更糟)。 If you encounter any, feel free to edit. 如果遇到任何问题,请随时进行编辑。
Prerequisites 先决条件
The starting point of the following explanation assumes you're familiar with the content and use of suffix trees, and the characteristics of Ukkonen's algorithm, eg how you're extending the suffix tree character by character, from start to end. 以下说明的起点假定您熟悉后缀树的内容和用法,以及Ukkonen算法的特征,例如,如何从头到尾逐个字符地扩展后缀树。 Basically, I assume you've read some of the other explanations already. 基本上,我假设您已经阅读了其他一些说明。
(However, I did have to add some basic narrative for the flow, so the beginning might indeed feel redundant.) (但是,我确实必须为流程添加一些基本的叙述,因此开始时确实可能感觉很多余。)
The most interesting part is the explanation on the difference between using suffix links and rescanning from the root . 最有趣的部分是对使用后缀链接和从根目录重新扫描之间的区别的解释 。 This is what gave me a lot of bugs and headaches in my implementation. 这就是我在实施过程中遇到的许多错误和头痛的原因。
Open-ended leaf nodes and their limitations 开放式叶节点及其局限性
I'm sure you already know that the most fundamental 'trick' is to realize we can just leave the end of the suffixes 'open', ie referencing the current length of the string instead of setting the end to a static value. 我确定您已经知道最基本的“技巧”是意识到我们可以只保留后缀“ open”的结尾,即引用字符串的当前长度,而不是将结尾设置为静态值。 This way when we add additional characters, those characters will be implicitly added to all suffix labels, without having to visit and update all of them. 这样,当我们添加其他字符时,这些字符将隐式添加到所有后缀标签中,而无需访问和更新所有后缀。
But this open ending of suffixes – for obvious reasons – works only for nodes that represent the end of the string, ie the leaf nodes in the tree structure. 但是,由于明显的原因,后缀的这种开放结尾仅适用于表示字符串结尾的节点,即树结构中的叶节点。 The branching operations we execute on the tree (the addition of new branch nodes and leaf nodes) won't propagate automatically everywhere they need to. 我们在树上执行的分支操作(添加新的分支节点和叶节点)不会自动传播到所需的任何地方。
It's probably elementary, and wouldn't require mention, that repeated substrings don't appear explicitly in the tree, since the tree already contains these by virtue of them being repetitions; 重复的子字符串不会显式地出现在树中,这很可能是基本的,因此无需赘述,因为由于树是重复的,所以树中已经包含了这些子字符串。 however, when the repetitive substring ends by encountering a non-repeating character, we need to create a branching at that point to represent the divergence from that point onwards. 但是,当重复子字符串由于遇到非重复字符而结束时,我们需要在该点创建一个分支以表示从该点开始的分支。
For example in case of the string 'ABCXABCY' (see below), a branching to X and Y needs to be added to three different suffixes, ABC , BC and C ; 例如,对于字符串“ ABCXABCY” (请参见下文),需要将X和Y的分支添加到三个不同的后缀ABC , BC和C中 ; otherwise it wouldn't be a valid suffix tree, and we couldn't find all substrings of the string by matching characters from the root downwards. 否则,它将不是有效的后缀树,并且通过从根向下匹配字符,我们无法找到字符串的所有子字符串。
Once again, to emphasize – any operation we execute on a suffix in the tree needs to be reflected by its consecutive suffixes as well (eg ABC > BC > C), otherwise they simply cease to be valid suffixes. 再次强调一下–我们对树中后缀执行的任何操作也必须由其连续后缀(例如,ABC> BC> C)反映出来,否则它们不再是有效的后缀。
But even if we accept that we have to do these manual updates, how do we know how many suffixes need to be updated? 但是,即使我们接受必须进行这些手动更新,我们如何知道需要更新多少个后缀? Since, when we add the repeated character A (and the rest of the repeated characters in succession), we have no idea yet when/where do we need to split the suffix into two branches. 因为,当我们添加重复字符A (以及连续的其余重复字符)时,我们还不知道何时何地需要将后缀分成两个分支。 The need to split is ascertained only when we encounter the first non-repeating character, in this case Y (instead of the X that already exists in the tree). 仅当我们遇到第一个非重复字符时才确定需要拆分,在这种情况下为Y (而不是树中已经存在的X )。
What we can do is to match the longest repeated string we can, and count how many of its suffixes we need to update later. 我们可以做的是匹配最长的重复字符串,并计算以后需要更新的后缀个数。 This is what 'remainder' stands for. 这就是“剩余”的意思。
The concept of 'remainder' and 'rescanning' “剩余”和“重新扫描”的概念
The variable remainder
tells us how many repeated characters we added implicitly, without branching; 变量remainder
告诉我们隐式添加了多少个重复字符,没有分支; ie how many suffixes we need to visit to repeat the branching operation once we found the first character that we cannot match. 也就是说,一旦发现无法匹配的第一个字符,我们需要访问多少个后缀以重复分支操作。 This essentially equals to how many characters 'deep' we are in the tree from its root. 这实质上等于从树的根开始我们在树中有多少个“深”字符。
So, staying with the previous example of the string ABCXABCY , we match the repeated ABC part 'implicitly', incrementing remainder
each time, which results in remainder of 3. Then we encounter the non-repeating character 'Y' . 因此,与字符串ABCXABCY的前面的示例相同 ,我们“隐式”匹配重复的ABC部分,每次增加remainder
,结果是余数3。然后我们遇到了非重复字符“ Y” 。 Here we split the previously added ABCX into ABC -> X and ABC -> Y . 在这里,我们将先前添加的ABCX分为ABC- > X和ABC- > Y。 Then we decrement remainder
from 3 to 2, because we already took care of the ABC branching. 然后,将remainder
从3减少到2,因为我们已经处理了ABC分支。 Now we repeat the operation by matching the last 2 characters – BC – from the root to reach the point where we need to split, and we split BCX too into BC -> X and BC -> Y . 现在,我们通过从根开始匹配最后两个字符BC到需要拆分的点来重复该操作,然后将BCX也拆分为BC- > X和BC- > Y。 Again, we decrement remainder
to 1, and repeat the operation; 再次,我们将remainder
减为1,然后重复该操作; until the remainder
is 0. Lastly, we need to add the current character ( Y ) itself to the root as well. 直到remainder
为0。最后,我们还需要将当前字符( Y )本身也添加到根中。
This operation, following the consecutive suffixes from the root simply to reach the point where we need to do an operation is what's called 'rescanning' in Ukkonen's algorithm, and typically this is the most expensive part of the algorithm. 该操作从根开始连续跟随后缀,直到我们需要进行操作的点,这在Ukkonen的算法中称为“重新扫描” ,通常这是算法中最昂贵的部分。 Imagine a longer string where you need to 'rescan' long substrings, across many dozens of nodes (we'll discuss this later), potentially thousands of times. 想象一个更长的字符串,您可能需要跨数十个节点(我们将在后面讨论)在多个节点上“重新扫描”长子字符串。
As a solution, we introduce what we call 'suffix links' . 作为解决方案,我们介绍了所谓的“后缀链接” 。
The concept of 'suffix links' “后缀链接”的概念
Suffix links basically point to the positions we'd normally have to 'rescan' to, so instead of the expensive rescan operation we can simply jump to the linked position, do our work, jump to the next linked position, and repeat – until there are no more positions to update. 后缀链接基本上指向我们通常必须“重新扫描”的位置,因此,代替昂贵的重新扫描操作,我们可以简单地跳至链接位置,进行工作,跳至下一个链接位置,然后重复–直到出现没有更多职位可更新。
Of course one big question is how to add these links. 当然,一个大问题是如何添加这些链接。 The existing answer is that we can add the links when we insert new branch nodes, utilizing the fact that, in each extension of the tree, the branch nodes are naturally created one after another in the exact order we'd need to link them together. 现有的答案是,我们可以在插入新的分支节点时添加链接,这利用了以下事实:在树的每个扩展中,自然需要按照确切的顺序一个接一个地创建分支节点,我们需要将它们链接在一起。 Though, we have to link from the last created branch node (the longest suffix) to the previously created one, so we need to cache the last we create, link that to the next one we create, and cache the newly created one. 虽然,我们必须从最后创建的分支节点(最长的后缀)链接到先前创建的分支节点,所以我们需要缓存我们创建的最后一个分支节点,将其链接到我们创建的下一个分支节点,并缓存新创建的分支节点。
One consequence is that we actually often don't have suffix links to follow, because the given branch node was just created. 结果是实际上我们通常没有后缀链接,因为给定的分支节点是刚刚创建的。 In these cases we have to still fall back to the aforementioned 'rescanning' from root. 在这些情况下,我们必须从根本上退回到前述的“重新扫描” 。 This is why, after an insertion, you're instructed to either use the suffix link, or jump to root. 这就是为什么在插入后会提示您使用后缀链接或跳转到根目录的原因。
(Or alternatively, if you're storing parent pointers in the nodes, you can try to follow the parents, check if they have a link, and use that. I found that this is very rarely mentioned, but the suffix link usage is not set in stones. There are multiple possible approaches, and if you understand the underlying mechanism you can implement one that fits your needs the best.) (或者,如果您将父指针存储在节点中,则可以尝试跟随父节点,检查它们是否具有链接,并使用该链接。我发现很少提及该链接 ,但是后缀链接用法并未提及在石集,有多种可能的方法,如果你了解底层机制可以实现一个适合您的需求是最好的。)
The concept of 'active point' “活跃点”的概念
So far we discussed multiple efficient tools for building the tree, and vaguely referred to traversing over multiple edges and nodes, but haven't yet explored the corresponding consequences and complexities. 到目前为止,我们讨论了用于构建树的多种有效工具,并且模糊地涉及遍历多个边缘和节点,但是尚未探讨相应的后果和复杂性。
The previously explained concept of 'remainder' is useful for keeping track where we are in the tree, but we have to realize it doesn't store enough information. 前面解释的“剩余”概念对于跟踪我们在树中的位置很有用,但是我们必须意识到它没有存储足够的信息。
Firstly, we always reside on a specific edge of a node, so we need to store the edge information. 首先,我们总是驻留在节点的特定边缘上,因此我们需要存储边缘信息。 We shall call this 'active edge' . 我们称其为“主动边缘” 。
Secondly, even after adding the edge information, we still have no way to identify a position that is farther down in the tree, and not directly connected to the root node. 其次,即使添加了边缘信息后,我们仍然没有办法确定一个位置,在树越往下,而不是直接连接到根节点 。 So we need to store the node as well. 因此,我们还需要存储该节点。 Let's call this 'active node' . 我们将此称为“活动节点” 。
Lastly, we can notice that the 'remainder' is inadequate to identify a position on an edge that is not directly connected to root, because 'remainder' is the length of the entire route; 最后,我们可以注意到, “余数”不足以标识未直接连接到根的边上的位置,因为“余数”是整条路线的长度。 and we probably don't want to bother with remembering and subtracting the length of the previous edges. 而且我们可能不想打扰记住和减去前一条边的长度。 So we need a representation that is essentially the remainder on the current edge . 因此,我们需要一种表示形式,基本上是当前边缘上的其余部分 。 This is what we call 'active length' . 这就是我们所说的“活动长度” 。
This leads to what we call 'active point' – a package of three variables that contain all the information we need to maintain about our position in the tree: 这导致了我们所谓的“活动点” –三个变量的包,其中包含我们需要维护的有关树中位置的所有信息:
Active Point = (Active Node, Active Edge, Active Length)
You can observe on the following image how the matched route of ABCABD consists of 2 characters on the edge AB (from root ), plus 4 characters on the edge CABDABCABD (from node 4) – resulting in a 'remainder' of 6 characters. 您可以在下图上观察到, ABCABD的匹配路由如何由边缘AB上的2个字符(来自root ),以及边缘CABDABCABD上的4个字符(来自节点4)组成-导致“剩余”为6个字符。 So, our current position can be identified as Active Node 4, Active Edge C, Active Length 4 . 因此,我们当前的位置可以标识为活动节点4,活动边缘C,活动长度4 。
Another important role of the 'active point' is that it provides an abstraction layer for our algorithm, meaning that parts of our algorithm can do their work on the 'active point' , irrespective of whether that active point is in the root or anywhere else. “活动点”的另一个重要作用是,它为我们的算法提供了一个抽象层,这意味着我们算法的各个部分都可以在“活动点”上进行工作 ,而不管该活动点是在根中还是在其他任何地方。 This makes it easy to implement the use of suffix links in our algorithm in a clean and straight-forward way. 这使得在我们的算法中以简洁明了的方式轻松实现后缀链接的使用。
Differences of rescanning vs using suffix links 重新扫描与使用后缀链接的区别
Now, the tricky part, something that – in my experience – can cause plenty of bugs and headaches, and is poorly explained in most sources, is the difference in processing the suffix link cases vs the rescan cases. 现在,棘手的部分(根据我的经验)会导致大量的错误和头痛,并且在大多数来源中都没有很好地解释,这是处理后缀链接案例与重新扫描案例的区别。
Consider the following example of the string 'AAAABAAAABAAC' : 考虑以下字符串'AAAABAAAABAAC'的示例:
You can observe above how the 'remainder' of 7 corresponds to the total sum of characters from root, while 'active length' of 4 corresponds to the sum of matched characters from the active edge of the active node. 您可以在上面观察到“余数” 7对应于来自根的字符总和,而“活动长度” 4对应于来自活动节点的活动边缘的匹配字符之和。
Now, after executing a branching operation at the active point, our active node might or might not contain a suffix link. 现在,在活动点执行分支操作之后,活动节点可能包含后缀链接,也可能不包含后缀链接。
If a suffix link is present: We only need to process the 'active length' portion. 如果存在后缀链接:我们只需要处理“活动长度”部分。 The 'remainder' is irrelevant, because the node where we jump to via the suffix link already encodes the correct 'remainder' implicitly , simply by virtue of being in the tree where it is. “余数”是无关紧要的,因为我们通过后缀链接跳转到的节点已经隐式地编码了正确的“余数” ,这仅仅是因为它位于所在的树中。
If a suffix link is NOT present: We need to 'rescan' from zero/root, which means processing the whole suffix from the beginning. 如果不存在后缀链接:我们需要从零/根开始“重新扫描” ,这意味着从头开始处理整个后缀。 To this end we have to use the whole 'remainder' as the basis of rescanning. 为此,我们必须使用整个“剩余”作为重新扫描的基础。
Example comparison of processing with and without a suffix link 有和没有后缀链接的处理示例比较
Consider what happens at the next step of the example above. 考虑上面的示例的下一步会发生什么。 Let's compare how to achieve the same result – ie moving to the next suffix to process – with and without a suffix link. 让我们比较一下如何获得相同的结果(即移至下一个要处理的后缀)(带或不带后缀链接)。
Using 'suffix link' 使用“后缀链接”
Notice that if we use a suffix link, we are automatically 'at the right place'. 请注意,如果我们使用后缀链接,我们将自动位于“正确的位置”。 Which is often not strictly true due to the fact that the 'active length' can be 'incompatible' with the new position. 由于“有效长度”可能与新职位“不兼容”,因此通常并非严格如此。
In the case above, since the 'active length' is 4, we're working with the suffix ' ABAA' , starting at the linked Node 4. But after finding the edge that corresponds to the first character of the suffix ( 'A' ), we notice that our 'active length' overflows this edge by 3 characters. 在上述情况下,由于“有效长度”为4,所以我们从链接的节点4开始使用后缀“ ABAA” 。但是在找到与后缀的第一个字符( 'A' ),我们注意到我们的“有效长度”在此边沿溢出了3个字符。 So we jump over the full edge, to the next node, and decrement 'active length' by the characters we consumed with the jump. 因此,我们跳过整个边缘,移至下一个节点,并根据跳转所消耗的字符来减少“活动长度” 。
Then, after we found the next edge 'B' , corresponding to the decremented suffix 'BAA ', we finally note that the edge length is larger than the remaining 'active length' of 3, which means we found the right place. 然后,在找到与后缀'BAA '相对应的下一个边缘'B'之后 ,我们最终注意到边缘长度大于剩余的“有效长度” 3,这意味着我们找到了正确的位置。
Please note that it seems this operation is usually not referred to as 'rescanning', even though to me it seems it's the direct equivalent of rescanning, just with a shortened length and a non-root starting point. 请注意,似乎此操作通常不被称为“重新扫描”,即使在我看来,这与重新扫描直接等效,只是缩短了长度且没有根目录起始点。
Using 'rescan' 使用“重新扫描”
Notice that if we use a traditional 'rescan' operation (here pretending we didn't have a suffix link), we start at the top of the tree, at root, and we have to work our way down again to the right place, following along the entire length of the current suffix. 请注意,如果我们使用传统的“重新扫描”操作(在这里假装没有后缀链接),那么我们将从树的顶部开始,从根开始,然后我们必须再次向下移动到正确的位置,沿当前后缀的整个长度。
The length of this suffix is the 'remainder' we discussed before. 此后缀的长度是我们前面讨论的“余数” 。 We have to consume the entirety of this remainder, until it reaches zero. 我们必须消耗掉剩余的全部,直到达到零。 This might (and often does) include jumping through multiple nodes, at each jump decreasing the remainder by the length of the edge we jumped through. 这可能(并且经常如此)包括跳过多个节点,每次跳过都会使剩余部分减少我们跳过的边的长度。 Then finally, we reach an edge that is longer than our remaining 'remainder' ; 最后,我们到达的边缘比剩余的“余数”更长; here we set the active edge to the given edge, set 'active length' to remaining 'remainder ', and we're done. 在这里,我们将有效边设置为给定的边,将“有效长度”设置为剩余的“剩余 ”,就可以了。
Note, however, that the actual 'remainder' variable needs to be preserved, and only decremented after each node insertion. 但是请注意,实际的“ remainder”变量需要保留,并且仅在每次插入节点后才递减。 So what I described above assumed using a separate variable initialized to 'remainder' . 因此,我上面所描述的假设使用的是初始化为'remainder'的单独变量。
Notes on suffix links & rescans 关于后缀链接和重新扫描的注意事项
1) Notice that both methods lead to the same result. 1)请注意,两种方法均会导致相同的结果。 Suffix link jumping is, however, significantly faster in most cases; 但是,在大多数情况下,后缀链接跳转明显更快。 that's the whole rationale behind suffix links. 这就是后缀链接的全部原理。
2) The actual algorithmic implementations don't need to differ. 2)实际的算法实现无需区别。 As I mentioned above, even in the case of using the suffix link, the 'active length' is often not compatible with the linked position, since that branch of the tree might contain additional branching. 如上所述,即使在使用后缀链接的情况下, “有效长度”也常常与链接位置不兼容,因为树的该分支可能包含其他分支。 So essentially you just have to use 'active length' instead of 'remainder' , and execute the same rescanning logic until you find an edge that is shorter than your remaining suffix length. 因此,从本质上讲,您只需要使用“有效长度”而不是“剩余 长度” ,并执行相同的重新扫描逻辑,直到找到比剩余后缀长度短的边即可。
3) One important remark pertaining to performance is that there is no need to check each and every character during rescanning. 3)关于性能的一个重要说明是,在重新扫描期间无需检查每个字符。 Due to the way a valid suffix tree is built, we can safely assume that the characters match. 由于有效的后缀树的构建方式,我们可以安全地假设字符匹配。 So you're mostly counting the lengths, and the only need for character equivalence checking arises when we jump to a new edge, since edges are identified by their first character (which is always unique in the context of a given node). 因此,您主要是在计算长度,并且当我们跳到新的边缘时,唯一需要进行字符等效性检查,因为边缘由其第一个字符标识(在给定节点的上下文中始终是唯一的)。 This means that 'rescanning' logic is different than full string matching logic (ie searching for a substring in the tree). 这意味着“重新扫描”逻辑与全字符串匹配逻辑(即在树中搜索子字符串)不同。
4) The original suffix linking described here is just one of the possible approaches . 4)此处描述的原始后缀链接只是可能的方法之一 。 For example NJ Larsson et al. 例如NJ Larsson等。 names this approach as Node-Oriented Top-Down , and compares it to Node-Oriented Bottom-Up and two Edge-Oriented varieties. 将该方法命名为“ 面向节点的自上而下” ,并将其与“ 面向节点的自下而上”和两个“面向边缘的”方法进行比较。 The different approaches have different typical and worst case performances, requirements, limitations, etc., but it generally seems that Edge-Oriented approaches are an overall improvement to the original. 不同的方法具有不同的典型和最坏情况下的性能,要求,限制等,但是通常看来, 面向边缘的方法是对原始方法的整体改进。