后缀树的创建

最新推荐文章于 2020-04-03 14:22:12 发布

原创最新推荐文章于 2020-04-03 14:22:12 发布 · 3.7w 阅读

8 ·

CC 4.0 BY-SA版权

算法专栏收录该内容

38 篇文章

订阅专栏

本文通过逐步构建过程详细解释了如何用特定方法创建后缀树，以字符串abcabxabcd为例，介绍了插入字符时遵循的原则，包括开辟新结构、追加字符到终结节点、合并节点等，分析了时间复杂度，指出在最坏情况下算法的时间复杂度为O(n^2)。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

后缀树创建的算法比较复杂，不是很好懂，个人找到了一个比较好理解的方法。下面一步一步构建一棵后缀树。以字符串 abcabxabcd 为例。

以下各图中，灰色圆表示根节点；红色表示终结，它上面的一个节点是叶子；无色表示一般节点；从根节点到叶子节点构成了当前插入的字符串的后缀。

1.基本方法

_______________________________________________________________________________

插入：a

使用的原则：

原则1：每插入一个字符，都开辟一个（根节点 -> 节点 -> 终结）的结构。

插入 a 时，开辟下面的结构：

因为之前的后缀树是空树，所以直接将上面的结构作为后缀树的初始化。

_______________________________________________________________________________

插入：b

使用的原则：

原则2：将插入的字符追加到所有终结节点的父节点的字符串值的末尾。如果父节点还有其它分支，则在终结节点与父节点之间再创建一个节点，并把字符串设为该字符（如图左边一条分支的节点，原来值为 a ，现在为 ab）

原则1：每插入一个字符，都开辟一个（根节点 -> 节点 -> 终结）的结构。（如图右边一条分支）

这里，引入一个概念，如果终结节点的父节点没有其它分支，则这个父节点是一个叶子节点。如此时，ab, b 都是叶子节点

_______________________________________________________________________________

插入：c

使用的原则：

原则2：不再缀述；

原则1：不再缀述；

_______________________________________________________________________________

插入：a

使用的原则：

原则2：不再缀述；

原则1：不再缀述；

原则3：每次插入一个字符时，都检查新生成的（根节点 -> 节点 -> 终结）结构（如果有的话），在根节点的直接子节点中找到字符串值以该字符开始的那个节点（如果有的话，只可能有一个），将它们合并。左图展示了新增的 a 节点与根节点的子节点 abca 可以合并，合并成为右图，右图即为插入到该字符时的后缀树。

_______________________________________________________________________________

插入：b

使用的原则：

原则2：不再缀述；左边的 a 节点与终结节点之间会产生一个节点，字符串值为 b。

原则1：不再缀述；

原则3：不再缀述；

原则4：插入字符时，查看所有凭空产生的节点（原则1或原则2中产生的节点），与其兄弟节点能不能合并（与原则3里面的合并是一样的处理逻辑），若能合并则合并（如果兄弟节点存在合并，只可能有一个兄弟节点能合并），不能合并则跳过。

如下图中的蓝色的 b 是凭空产生的节点，它与兄弟节点 bcab 可以合并。

实际上，原则3 是原则4的特例。只要是新产生的节点，都要合并。

___________________________________________________________________________________

插入：x

使用的原则：

原则2：不再缀述；

原则1：不再缀述；

___________________________________________________________________________________

插入：a

使用的原则：

原则2：不再缀述；

原则1：不再缀述；

原则3：不再缀述；

___________________________________________________________________________________

插入：b

使用的原则：

原则2：不再缀述；

原则1：不再缀述；

原则3：不再缀述；

原则4：不再缀述；

___________________________________________________________________________________

插入：c

使用的原则：

原则2：不再缀述；

原则1：不再缀述；

原则3：不再缀述；

原则4：不再缀述；

___________________________________________________________________________________

插入：d

使用的原则：

原则2：不再缀述；

原则1：不再缀述；

至此，一棵后缀树便构建好了。逻辑其实是非常简单的，我们只需要原则1，原则2，原则4即可构建正确任何字符串的后缀树。

我们来分析一下时间复杂度。

首先，遍历字符串，时间复杂度为 n。

再者，看原则1，2，4是怎样的情况。

原则1：每插入一个字符，都开辟一个（根节点 -> 节点 -> 终结）的结构。

时间复杂度为常数。不必多说

原则2：将插入的字符追加到所有终结节点的父节点的字符串值的末尾。

看似在整个构建过程中这个操作的时间复杂度为： 1 + 2 + 3 + ... + (n - 1) = O(n*n)，（插入第 k 个字符需要更新 k-1 个叶子节点）。

可能你想到了一个优化的方案，如果我们把每个叶子节点表示的字符串表示为一个整数 x ，表示该节点的字符串值在原字符串中的索引，这样，就可以省去每次往叶子节点字符串末尾加入字符的操作，因为任意时刻，我们永远只需要用一个整数就可以知道该节点的字符串值是多少。等等，还有一个问题，对于非叶子节点怎么处理？我们这时就要保存两个整数 x, y ，分别表示该节点的字符串值在原字符串中的开始索引和结束索引。因此，我们可以统一为：每个节点都用两个整数 x , y 来表示字符串值，如果 y = -1 则说明这个节点的字符串值是原字符串的从 x 到末尾的子字符串，如果 y != -1 ，则说明是从 x 到 y 的子字符串，这样便可以节省拷贝字符串的空间。而且对于叶子节点，我们不需要更新它的 y，因为它的 y 一直都是 -1。因此，字符串的改变时间可以忽略不讲。

我们需要一个结构来保存每吸入一个字符后，所有凭空产生的节点。原则四的复杂度关键在于，插入一个字符，会有多少个“凭空产生的节点”。很不幸，最坏的情况下，会产生 n 个，这个 n 是已插入的字符的数量。为什么呢？

我们先引入一个概念，把可能产生合并的节点称为 hasEmptySon 节点。

这两棵后缀树，只要再吸入一个字符 a ，则会产生合并（两个绿色的节点的值变成 aa，其左边兄弟分支将凭空多出一个 a 节点。此时将产生合并）。我们把上图中的紫色的节点称为 hasEmptySon 节点，从字面上理解，即有空的子节点的节点，如上图，hasEmptySon 左边的分支和中间的分支是空，上面没有节点。很明显，凭空产生的节点的父节点是 hasEmptySon 节点。得到结论，每次后缀树吸入一个字符，只需要判断 hasEmptySon有没有子节点的字符串是以该字符开始，即可得知是否需要合并。

两棵树吸入 a 之后到合并的图如下所示：

两棵后缀树合并如下图所示：

在实际算法时，我们需要保存每吸入一个字符，产生的所有的 hasEmptySon 节点，然后用于吸入下一个字符的合并。

很明显， hasEmptySon 节点不可能超过已有的节点。我们只需要证明所有节点都是 hasEmptySon 节点即可。我们举例说明，当字符串为 "bbbbbbb" 时，逐个画出的后缀树依次为：

很明显看出，每吸入一个字符，除最后一个叶子节点和根节点外，其它所有节点均是 hasEmptySon 节点。

即，最坏的情况下，每吸入一个字符，当前的 hasEmptySon 节点是 n - 1 个。

而算法的核心在于，对于每吸入的字符，比较这个字符与所有的 hasEmptySon 的每个子节点（这里最少的时间复杂度是线性的，考虑用哈希存储子节点的首字符与节点地址的对应），看是否有与某个子节点的首字符相等，相等则合并，不相等则创建新节点作为 hasEmptySon 的子节点。

因此，时间复杂度为 0 + 1 + 2 + ... + (n - 1) = O(n * n)。

因此，本算法的时间复杂度为二次方。

最终给出我实现的代码。

#ifndef __SUFFIX_TREE_IMPL
#define __SUFFIX_TREE_IMPL
#include 
#include 
#include 

using std::vector;
using std::set;



class SuffixTreeImpl;
class TreeNode
{
private:
	friend class SuffixTreeImpl;
	static const int ROOT_BEGIN = -1;
	static const int TO_END = -1;
	static const int FIND_NO_SON_START_WITH_CH = -1;
	const char *__string;
	int __begin,__end;		//开始，结束
	TreeNode *__parent;
	vector __sons;
	bool __hasCurCRoundRemainEmptySon;
	bool __dealing;
	bool __needToDelete;
public:
	TreeNode(const char *str,int begin);
	
	bool isRootNode();

	~TreeNode();

	char firstCh();

	TreeNode* parent();
	
	void addSon(TreeNode *node);

	int sonCount();

	void setParent(TreeNode *parent);

	void setBegin(int beginPos);

	int getBegin();

	int getEnd();

	void setEnd(int endPos);

	void print(int spaceCount);

	int findSonStartWith(char ch);

	void combineWithSonsStartWith(int curPosset,set& nextHasEmptySonNodes);

	void upMoveCombine(TreeNode* combineNode,TreeNode* targetNode,int curPos,set& nextHasSonEmptyNodes);

	int stringLength(int curPos);

	void addSons(vector* node );

	vector* getSons();

	void removeSon(TreeNode* node);

	void removeAllSons();
	
	void hasCurRoundRemainEmptySonNode(bool hasCurCRoundRemainEmptySon);
	
	bool hasCurRoundRemainEmptySonNode();

	bool markDelete();
	void markDelete(bool delete_);

	bool atDealing();
	void atDealing(bool dealing);

};

class SuffixTreeImpl
{
private:
	const char *__string;
	TreeNode *__suffixTree;
	set __nodesHasEmptySon;
	
public:
	SuffixTreeImpl(const char *str);
	virtual ~SuffixTreeImpl();
	void create();
	void print();
private:
	void addChar(int curPos);
	
};



#endif
#include "SuffixTreeImpl.h"

SuffixTreeImpl::SuffixTreeImpl(const char *str)
{
	__string = str;
}

SuffixTreeImpl::~SuffixTreeImpl()
{
	
}


void SuffixTreeImpl::create()
{
	__suffixTree = new TreeNode(__string,TreeNode::ROOT_BEGIN);
	__suffixTree->hasCurRoundRemainEmptySonNode(true);	//根节点永远是 hasEmptySon 节点
	__suffixTree->atDealing(true);
	__nodesHasEmptySon.insert(__nodesHasEmptySon.begin(),__suffixTree);
	int nlength = strlen(__string);
	for (int i = 0;i < nlength;++ i)
	{
		addChar(i);
	}
}

void SuffixTreeImpl:: addChar(int curPos)
{
	set newEmptyNodes;
	TreeNode *curNode;
	set::iterator it = __nodesHasEmptySon.begin();
	for (;it != __nodesHasEmptySon.end();)
	{
		curNode = (TreeNode*)(*it);
		if(curNode->markDelete())	//如果 combineWithSonsStartWith 递归处理已经标记了后面将要处理的节点要被删除，此时就应该删除
		{
			delete curNode;
			it = __nodesHasEmptySon.erase(it);
		}
		else
		{
			if(curNode->hasCurRoundRemainEmptySonNode())
			{
				curNode->combineWithSonsStartWith(curPos,newEmptyNodes);
				if(!curNode->isRootNode())
				{
					it = __nodesHasEmptySon.erase(it);
				}
				else
				{
					curNode->hasCurRoundRemainEmptySonNode(true);	//根节点下一轮又是 hasEmptySon 节点
					++ it;
				}
			}
			else
			{
				it = __nodesHasEmptySon.erase(it);
			}
			if(!curNode->isRootNode())curNode->atDealing(false);
		}
		
	}
	it = newEmptyNodes.begin();
	for (;it != newEmptyNodes.end();++ it)
	{
		((TreeNode*)(*it))->hasCurRoundRemainEmptySonNode(true);
		((TreeNode*)(*it))->atDealing(true);	//下一轮
		__nodesHasEmptySon.insert(*it);
	}
}

void SuffixTreeImpl:: print()
{
	__suffixTree->print(0);
}

bool TreeNode::isRootNode()
{
	return ROOT_BEGIN == __begin;
}

TreeNode::TreeNode( const char *str,int begin )
{
	__string = str;
	__begin = begin;
	__end   = TO_END;
	__hasCurCRoundRemainEmptySon = false;
	__needToDelete = false;
	__dealing = false;
}

TreeNode::~TreeNode()
{
	vector::iterator it = __sons.begin();
	for (;it != __sons.end();++ it)
	{
		delete (*it);
	}
}

char TreeNode::firstCh()
{
	return *(__string + __begin);
}

void TreeNode::addSon( TreeNode *node )
{
	__sons.insert(__sons.end(),node);
	node->setParent(this);
}

void TreeNode::addSons(vector* node )
{
	__sons.insert(__sons.end(),node->begin(),node->end());
	vector::iterator it = node->begin();
	for (;it != node->end();++ it)
	{
		(*it)->setParent(this);
	}
}

int TreeNode::sonCount()
{
	return __sons.size();
}

void TreeNode::setParent( TreeNode *parent )
{
	__parent = parent;
}

void TreeNode::setBegin( int beginPos )
{
	__begin = beginPos;
}

int TreeNode::getBegin()
{
	return __begin;
}

void TreeNode::setEnd( int endPos )
{
	__end = endPos;
}

int TreeNode::getEnd()
{
	return __end;
}

void TreeNode::print( int spaceCount )
{
	for (int i = 0;i < spaceCount;++ i)
	{
		printf(" ");
	}
	if(__begin < strlen(__string))
	{
		int lastPos = (TO_END == __end ? strlen(__string) - 1 : __end);
		for (int i = __begin;i <= lastPos;++ i)
		{
			printf("%c",*(__string + i));
		}
	}
	printf("\r\n");
	vector::iterator it = __sons.begin();
	for (;it != __sons.end();++ it)
	{
		(*it)->print(spaceCount + 4);
	}
}

int TreeNode::findSonStartWith( char ch )
{
	vector::iterator it = __sons.begin();
	for (;it != __sons.end();++ it)
	{
		if(ch == (*it)->firstCh())
		{
			return it - __sons.begin();
		}
	}
	return FIND_NO_SON_START_WITH_CH;
}


void TreeNode::combineWithSonsStartWith( int curPos ,set& nextHasSonEmptyNodes)
{
	this->hasCurRoundRemainEmptySonNode(false);
	int offset = findSonStartWith(*(__string + curPos));
	if(FIND_NO_SON_START_WITH_CH != offset)
	{
		vector::iterator it = __sons.begin() + offset;
		TreeNode *foundSonNode = *it;
		if(foundSonNode->hasCurRoundRemainEmptySonNode())
		{
			foundSonNode->combineWithSonsStartWith(curPos,nextHasSonEmptyNodes);
			this->combineWithSonsStartWith(curPos,nextHasSonEmptyNodes);
		}
		else
		{
			if(1 == sonCount())
			{
				TreeNode *combineNode;
				if(this->isRootNode())
				{
					combineNode = new TreeNode(__string,foundSonNode->getBegin());
					this->removeSon(foundSonNode);
					this->addSon(combineNode);
					combineNode->addSon(foundSonNode);
				}
				else
				{
					combineNode = this;
				}
				upMoveCombine(combineNode,foundSonNode,curPos,nextHasSonEmptyNodes);
			}
			else
			{
				TreeNode *combineNode = new TreeNode(__string,foundSonNode->getBegin());
				this->removeSon(foundSonNode);
				this->addSon(combineNode);
				combineNode->addSon(foundSonNode);
				upMoveCombine(combineNode,foundSonNode,curPos,nextHasSonEmptyNodes);
			}
		}
	}
	else
	{
		TreeNode* newLeafNode = new TreeNode(__string,curPos);			
		newLeafNode->setParent(this);
		addSon(newLeafNode);
	}
}

void TreeNode::upMoveCombine(TreeNode* combineNode,TreeNode* targetNode,int curPos,set& nextHasSonEmptyNodes)
{
	if(1 == targetNode->stringLength(curPos))	//吸干
	{
		targetNode->parent()->removeSon(targetNode);
		combineNode->addSons(targetNode->getSons());
		combineNode->setEnd(targetNode->getBegin());
		targetNode->removeAllSons();
		if(targetNode->atDealing())
		{
			targetNode->markDelete();
		}
		else
		{
			delete targetNode;
		}
	}
	else
	{
		combineNode->setEnd(targetNode->getBegin());
		targetNode->setBegin(targetNode->getBegin() + 1);
	}
	nextHasSonEmptyNodes.insert(combineNode);
}


int TreeNode::stringLength( int curPos )
{
	if(TO_END == __end)
	{
		return curPos - __begin + 1;
	}
	else
	{
		return __end - __begin + 1;
	}
}

vector* TreeNode::getSons()
{
	return &__sons;
}

void TreeNode::removeAllSons()
{
	__sons.clear();
}

TreeNode* TreeNode::parent()
{
	return __parent;
}

void TreeNode::hasCurRoundRemainEmptySonNode( bool hasCurCRoundRemainEmptySon )
{
	__hasCurCRoundRemainEmptySon = hasCurCRoundRemainEmptySon;
}

bool TreeNode::hasCurRoundRemainEmptySonNode()
{
	return __hasCurCRoundRemainEmptySon;
}

void TreeNode::removeSon( TreeNode* node )
{
	vector::iterator it = __sons.begin();
	for (;it != __sons.end();)
	{
		if(*it == node)
		{
			it = __sons.erase(it);
		}
		else
		{
			++ it;
		}
	}
}

bool TreeNode::markDelete()
{
	return __needToDelete;
}

void TreeNode::markDelete( bool delete_ )
{
	__needToDelete = delete_;
}
bool TreeNode::atDealing()
{
	return __dealing;
}

void TreeNode::atDealing( bool dealing )
{
	__dealing = dealing;
}