An Implementation of Double-Array Trie

本文介绍了一种高效的数字搜索树——双数组字典树(Double-Array Trie)的原理及其实现方法。双数组字典树是一种特殊的确定有限状态自动机(DFA),用于快速查找字符串。文中详细解释了其结构、压缩技术、插入删除操作,并提供了一个实际的实现案例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

An Implementation of Double-Array Trie

Contents

  1. What is Trie?
  2. What Does It Take to Implement a Trie?
  3. Tripple-Array Trie
  4. Double-Array Trie
  5. Suffix Compression
  6. Key Insertion
  7. Key Deletion
  8. Double-Array Pool Allocation
  9. An Implementation
  10. Download
  11. Other Implementations
  12. References

What is Trie?

Trie is a kind of digital search tree. (See[Knuth1972] for the detail of digital search tree.)[Fredkin1960] introduced the trieterminology, which is abbreviated from "Retrieval".

Trie Example

Trie is an efficient indexing method. It is indeed also a kind ofdeterministic finite automaton (DFA) (See [Cohen1990],for example, for the definition of DFA). Within the tree structure, each nodecorresponds to a DFA state, each (directed) labeled edge from a parent node toa child node corresponds to a DFA transition. The traversal starts at the rootnode. Then, from head to tail, one by one character in the key string is takento determine the next state to go. The edge labeled with the same character ischosen to walk. Notice that each step of such walking consumes one characterfrom the key and descends one step down the tree. If the key is exhausted anda leaf node is reached, then we arrive at the exit for that key. If we getstuck at some node, either because there is no branch labeled with the currentcharacter we have or because the key is exhausted at an internal node, thenit simply implies that the key is not recognized by the trie.

Notice that the time needed to traverse from the root to the leaf is notdependent on the size of the database, but is proportional to the length ofthe key. Therefore, it is usually much faster than B-tree or anycomparison-based indexing method in general cases. Its time complexity iscomparable with hashing techniques.

In addition to the efficiency, trie also provides flexibility in searchingfor the closest path in case that the key is misspelled. For example, byskipping a certain character in the key while walking, we can fix the insertionkind of typo. By walking toward all the immediate children of one node withoutconsuming a character from the key, we can fix the deletion typo, or evensubstitution typo if we just drop the key character that has no branch to goand descend to all the immediate children of the current node.

What Does It Take to Implement a Trie?

In general, a DFA is represented with a transition table, inwhich the rows correspond to the states, and the columns correspond to thetransition labels. The data kept in each cell is then the next state to go fora given state when the input is equal to the label.

This is an efficient method for the traversal, because every transitioncan be calculated by two-dimensional array indexing. However, in term of spaceusage, this is rather extravagant, because, in the case of trie, most nodeshave only a few branches, leaving the majority of the table cells blanks.

Meanwhile, a more compact scheme is to use a linked list to store thetransitions out of each state. But this results in slower access, due tothe linear search.

Hence, table compression techniques which still allows fast access havebeen devised to solve the problem.

  1. [Johnson1975] (Also explained in [Aho+1985] pp. 144-146) represented DFA with four arrays, which can be simplified to three in case of trie. The transition table rows are allocated in overlapping manner, allowing the free cells to be used by other rows.
  2. [Aoe1989] proposed an improvement from the three-array structure by reducing the arrays to two.

Tripple-Array Trie

As explained in [Aho+1985] pp. 144-146, a DFAcompression could be done using four linear arrays, namely default,base, next, and check. However, ina case simpler than the lexical analyzer, such as the mere trie for informationretrieval, the default array could be omitted. Thus, a triecan be implemented using three arrays according to this scheme.

Structure

The tripple-array structure is composed of:

  1. base. Each element in base corresponds to a node of the trie. For a trie node s, base[s] is the starting index within the next and check pool (to be explained later) for the row of the node s in the transition table.
  2. next. This array, in coordination with check, provides a pool for the allocation of the sparse vectors for the rows in the trie transition table. The vector data, that is, the vector of transitions from every node, would be stored in this array.
  3. check. This array works in parallel to next. It marks the owner of every cell in next. This allows the cells next to one another to be allocated to different trie nodes. That means the sparse vectors of transitions from more than one node are allowed to be overlapped.

Definition 1. For a transition from state s tot which takes character c as the input, the conditionmaintained in the tripple-array trie is:

check[ base[ s] + c] = s
next[ base[ s] + c] = t

Tripple-Array Structure

Walking

According to definition 1, the walking algorithm for agiven state s and the input character c is:

t := base[ s] + c;
if check[ t] = s then next state := next[ t] else fail endif

Construction

To insert a transition that takes character c to traversefrom a state s to another state t, the cellnext[base[s] + c]]must be managed to be available. If it is already vacant, we are lucky.Otherwise, either the entire transition vector for the current owner of thecell or that of the state s itself must be relocated. Theestimated cost for each case could determine which one to move. After findingthe free slots to place the vector, the transition vector must berecalculated as follows. Assuming the new place begins at b,the procedure for the relocation is:

Procedure Relocate( s : state; b : base_index) { Move base for state s to a new place beginning at b } begin foreach input character c for the state s { i.e. foreach c such that check[base[s] + c]] = s } begin check[ b + c] := s; { mark owner } next[ b + c] := next[ base[ s] + c]; { copy data } check[ base[ s] + c] := none { free the cell } end; base[ s] := b end

Tripple-Array Relocation

Double-Array Trie

The tripple-array structure for implementing trie appears to be well defined,but is still not practical to keep in a single file. Thenext/checkpool may be able to keep in a single array of integer couples, but thebase array does not grow in parallel to the pool, and is thereforeusually split.

To solve this problem, [Aoe1989] reduced thestructure into two parallel arrays. In the double-array structure, thebase and next are merged, resulting in only twoparallel arrays, namely, base and check.

Structure

Instead of indirectly referencing through state numbers asin tripple-array trie, nodes in double-array trie are linked directly withinthe base/check pool.

Definition 2. For a transition from state s tot which takes character c as the input, the conditionmaintained in the double-array trie is:

check[ base[ s] + c] = s
base[ s] + c = t

Double-Array Structure

Walking

According to definition 2, the walking algorithm for agiven state s and the input character c is:

t := base[ s] + c;
if check[ t] = s then next state := t else fail endif

Construction

The construction of double-array trie is in principle the same as that oftripple-array trie. The difference is the base relocation:

Procedure Relocate( s : state; b : base_index) { Move base for state s to a new place beginning at b } begin foreach input character c for the state s { i.e. foreach c such that check[base[s] + c]] = s } begin check[ b + c] := s; { mark owner } base[ b + c] := base[ base[ s] + c]; { copy data } { the node base[s] + c is to be moved to b + c; Hence, for any i for which check[i] = base[s] + c, update check[i] to b + c } foreach input character d for the node base[ s] + c begin check[ base[ base[ s] + c] + d] := b + c end; check[ base[ s] + c] := none { free the cell } end; base[ s] := b end

Double-Array Relocation

Suffix Compression

[Aoe1989] also suggested a storage compressionstrategy, by splitting non-branching suffixes into single string storages,called tail, so that the rest non-branching steps are reducedinto mere string comparison.

With the two separate data structures, double-array branches and suffix-spool tail, key insertion and deletion algorithms must be modifiedaccordingly.

Key Insertion

To insert a new key, the branching position can be found by traversing thetrie with the key one by one character until it gets stuck. The state wherethere is no branch to go is the very place to insert a new edge, labeled bythe failing character. However, with the branch-tail structure, the insertionpoint can be either in the branch or in the tail.

1. When the branching point is in the double-array structure

Suppose that the new key is a stringa1a2...ah-1ahah+1...an,wherea1a2...ah-1traverses the trie from the root to a node sr in the double-arraystructure, and there is no edge labeled ah that goes out ofsr. The algorithm called A_INSERT in[Aoe1989] does as follows:

From s r, insert edge labeled a h to new node s t;Let s t be a separate node poining to a string a h+1...a n in tail pool.

A_INSERT algorithm

2. When the branching point is in the tail pool

Since the path through a tail string has no branch, and therefore correspondsto exactly one key, suppose that the key corresponding to the tail is

a1a2...ah-1ah...ah+k-1b1...bm,

wherea1a2...ah-1 is in double-array structure, andah...ah+k-1b1...bm is in tail.Suppose that the substringa1a2...ah-1 traverses the trie from the rootto a node sr.

And suppose that the new key is in the form

a1a2...ah-1ah...ah+k-1ah+k...an,

where ah+k <> b1. The algorithm calledB_INSERT in [Aoe1989] does as follows:

From s r, insert straight path with a h...a h+k-1, ending at a new node s t;From s t, insert edge labeled b 1 to new node s u;Let s u be separate node pointing to a string b 2...b m in tail pool;From s t, insert edge labeled a h+k to new node s v;Let s v be separate node pointing to a string a h+k+1...a n in tail pool.

B_INSERT algorithm

Key Deletion

To delete a key from the trie, all we need to do is delete the tail blockoccupied by the key, and all double-array nodes belonging exclusively to the key, without touching any node belonging to other keys.

Consider a trie which accepts a language K = {pool#, prepare#, preview#,prize#, produce#, producer#, progress#} :

example trie

The key "pool#" can be deleted by removing the tail string "ol#" from thetail pool, and node 3 from the double-array structure. This is the simplestcase.

To remove the key "produce#", it is sufficient to delete node 14 from thedouble-array structure. But the resulting trie will not obay the conventionthat every node in the double-array structure, except the separate nodes whichpoint to tail blocks, must belong to more than one key. The path from node 10on will belong solely to the key "producer#".

But there is no harm violating this rule. The only drawback is theuncompactnesss of the trie. Traversal, insertion and deletion algoritms areintact. Therefore, this should be relaxed, for the sake of simplicity andefficiency of the deletion algorithm. Otherwise, there must be extra stepsto examine other keys in the same subtree ("producer#" for the deletion of"produce#") if any node needs to be moved from the double-array structure totail pool.

Suppose further that having removed "produce#" as such (by removing onlynode 14), we also need to remove "producer#" from the trie. What we have to dois remove string "#" from tail, and remove nodes 15, 13, 12, 11, 10 (which nowbelong solely to the key "producer#") from the double-array structure.

We can thus summarize the algorithm to delete a keyk = a1a2...ah-1ah...an,where a1a2...ah-1 is in double-array structure,and ah...an is in tail pool, as follows :

Let sr := the node reached by a 1a 2...a h-1;Delete a h...a n from tail; s := sr; repeat p := parent of s; Delete node s from double-array structure; s := p until s = root or outdegree( s) > 0.

Where outdegree(s) is the number of children nodesof s.

Double-Array Pool Allocation

When inserting a new branch for a node, it is possible that the array elementfor the new branch has already been allocated to another node. In that case,relocation is needed. The efficiency-critical part then turns out to be thesearch for a new place. A brute force algoritm iterates along thecheck array to find an empty cell to place the first branch, andthen assure that there are empty cells for all other branches as well.The time used is therefore proportional to the size of the double-array pooland the size of the alphabet.

Suppose that there are n nodes in the trie, and the alphabet isof size m. The size of the double-array structure would ben + cm, where c is a coefficient whichis dependent on the characteristic of the trie. And the time complexity ofthe brute force algorithm would beO(nm + cm2).

[Aoe1989] proposed a free-space list in thedouble-array structure to make the time complexity independent of the sizeof the trie, but dependent on the number of the free cells only. The check array for the free cells are redefined to keep a pointerto the next free cell (called G-link) :

Definition 3. Let r1, r2, ... ,rcm be the free cells in the double-array structure, orderedby position. G-link is defined as follows :

check[0] = -r 1
check[r i] = -r i+1 ; 1 <= i <= cm-1
check[r cm] = -1

By this definition, negative check means unoccupied in the samesense as that for "none" check in the ordinary algorithm. Thisencoding scheme forms a singly-linked list of free cells. When searching for anempty cell, only cm free cells are visited, instead of alln + cm cells as in the brute force algorithm.

This, however, can still be improved. Notice that for those cells withnegative check, the corresponding base's are notgiven any definition. Therefore, in our implementation, Aoe's G-link ismodified to be doubly-linked list by letting base of every freecell points to a previous free cell. This can speed up the insertion anddeletion processes. And, for convenience in referencing the list head and tail,we let the list be circular. The zeroth node is dedicated to be the entry pointof the list. And the root node of the trie will begin with cell number one.

Definition 4. Let r1, r2, ... ,rcm be the free cells in the double-array structure, orderedby position. G-link is defined as follows :

check[0] = -r 1
check[r i] = -r i+1 ; 1 <= i <= cm-1
check[r cm] = 0
base[0] = -r cm
base[r 1] = 0
base[r i+1] = -r i ; 1 <= i <= cm-1

Then, the searching for the slots for a node with input symbol setP = {c1, c2, ..., cp} needs to iterate onlythe cells with negative check :

{find least free cell s such that s > c1}s := - check[0]; while s <> 0 and s <= c 1 do s := - check[s] end; if s = 0 then return FAIL; {or reserve some additional space} {continue searching for the row, given that s matches c1} while s <> 0 do i := 2; while i <= p and check[s + c i - c 1] < 0 do i := i + 1 end; if i = p + 1 then return s - c 1; {all cells required are free, so return it} s := - check[s] end; return FAIL; {or reserve some additional space}

The time complexity for free slot searching is reduced toO(cm2). The relocation stage takes O(m2). The total time complexity is thereforeO(cm2 + m2) = O(cm2).

It is useful to keep the free list ordered by position, so that the accessthrough the array becomes more sequential. This would be beneficial when thetrie is stored in a disk file or virtual memory, because the disk caching orpage swapping would be used more efficiently. So, the free cell reusingshould maintain this strategy :

t := - check[0]; while check[t] <> 0 and t < s do t := - check[t] end; {t now points to the cell after s' place} check[s] := -t; check[- base[t]] := -s; base[s] := base[t]; base[t] := -s;

Time complexity of freeing a cell is thus O(cm).

An Implementation

In my implementation, I designed the API with persistent data in mind.Tries can be saved to disk and loaded for use afterward. And in newer versions,non-persistent usage is also possible. You can create a trie in memory, populate data to it, use it, and free it, without any disk I/O. Alternativelyyou can load a trie from disk and save it to disk whenever you want.

The trie data is portable across platforms. The byte order in the disk isalways little-endian, and is read correctly on either little-endian or big-endian systems.

Trie index is 32-bit signed integer. This allows 2,147,483,646 (231 - 2) total nodes in the trie data, which should be sufficientfor most problem domains. And each data entry can store a 32-bit integer valueassociated to it. This value can be used for any purpose, up to your needs.If you don't need to use it, just store some dummy value.

For sparse data compactness, the trie alphabet set should be continuous,but that is usually not the case in general character sets. Therefore, a mapbetween the input character and the low-level alphabet set for the trie iscreated in the middle. You will have to define your input character set bylisting their continuous ranges of character codes in a .abm (alphabet map)file when creating a trie. Then, each character will be automatically assignedinternal codes of continuous values.

Download

Update: The double-array trie implementation has been simplified and rewritten from scratch in C, and is now named libdatrie. It is now available under the terms ofGNU Lesser General PublicLicense (LGPL):

SVN: svn co http://linux.thai.net/svn/software/datrie

The old C++ source code below is under the terms ofGNU Lesser General PublicLicense (LGPL):

Other Implementations

References

  1. [Knuth1972] Knuth, D. E. The Art of Computer Programming Vol. 3, Sorting and Searching. Addison-Wesley. 1972.
  2. [Fredkin1960] Fredkin, E. Trie Memory. Communication of the ACM. Vol. 3:9 (Sep 1960). pp. 490-499.
  3. [Cohen1990] Cohen, D. Introduction to Theory of Computing. John Wiley & Sons. 1990.
  4. [Johnson1975] Johnson, S. C. YACC-Yet another compiler-compiler. Bell Lab. NJ. Computing Science Technical Report 32. pp.1-34. 1975.
  5. [Aho+1985] Aho, A. V., Sethi, R., Ullman, J. D. Compilers : Principles, Techniques, and Tools. Addison-Wesley. 1985.
  6. [Aoe1989] Aoe, J. An Efficient Digital Search Algorithm by Using a Double-Array Structure. IEEE Transactions on Software Engineering. Vol. 15, 9 (Sep 1989). pp. 1066-1077.
  7. [Virach+1993] Virach Sornlertlamvanich, Apichit Pittayaratsophon, Kriangchai Chansaenwilai. Thai Dictionary Data Base Manipulation using Multi-indexed Double Array Trie. 5th Annual Conference. National Electronics and Computer Technology Center. Bangkok. 1993. pp 197-206. (in Thai)

标题基于SpringBoot+Vue的社区便民服务平台研究AI更换标题第1章引言介绍社区便民服务平台的研究背景、意义,以及基于SpringBoot+Vue技术的研究现状和创新点。1.1研究背景与意义分析社区便民服务的重要性,以及SpringBoot+Vue技术在平台建设中的优势。1.2国内外研究现状概述国内外在社区便民服务平台方面的发展现状。1.3研究方法与创新点阐述本文采用的研究方法和在SpringBoot+Vue技术应用上的创新之处。第2章相关理论介绍SpringBoot和Vue的相关理论基础,以及它们在社区便民服务平台中的应用。2.1SpringBoot技术概述解释SpringBoot的基本概念、特点及其在便民服务平台中的应用价值。2.2Vue技术概述阐述Vue的核心思想、技术特性及其在前端界面开发中的优势。2.3SpringBoot与Vue的整合应用探讨SpringBoot与Vue如何有效整合,以提升社区便民服务平台的性能。第3章平台需求分析与设计分析社区便民服务平台的需求,并基于SpringBoot+Vue技术进行平台设计。3.1需求分析明确平台需满足的功能需求和性能需求。3.2架构设计设计平台的整体架构,包括前后端分离、模块化设计等思想。3.3数据库设计根据平台需求设计合理的数据库结构,包括数据表、字段等。第4章平台实现与关键技术详细阐述基于SpringBoot+Vue的社区便民服务平台的实现过程及关键技术。4.1后端服务实现使用SpringBoot实现后端服务,包括用户管理、服务管理等核心功能。4.2前端界面实现采用Vue技术实现前端界面,提供友好的用户交互体验。4.3前后端交互技术探讨前后端数据交互的方式,如RESTful API、WebSocket等。第5章平台测试与优化对实现的社区便民服务平台进行全面测试,并针对问题进行优化。5.1测试环境与工具介绍测试
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值