Classical approaches to regular expression searching

最新推荐文章于 2025-08-10 21:28:09 发布

tricky1997

最新推荐文章于 2025-08-10 21:28:09 发布

阅读量307

点赞数

分类专栏： GPU & regex 文章标签： algorithm construction character 算法 search transition

GPU & regex 专栏收录该内容

5 篇文章

订阅专栏

本文详细介绍了正则表达式匹配算法中的关键步骤，即如何将NFA（非确定有限自动机）转换为DFA（确定有限自动机），并解释了此过程背后的原理。通过构建DFA，实现了一种O(n)时间复杂度的字符串匹配算法，提高了搜索效率。此外，文章还探讨了如何通过自回环优化NFA以增强其在文本搜索中的应用，并提供了DFA构建算法的概述。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

5.3.1 Thompson's NFA simulation

是现在许多相关算法的基础，作者没有仔细介绍，就不研究了。

5.3.2Using a deterministic automaton

One of the early achievements in string matching was theO(n) time algorithm to search for a regular expression in a text. As explained, the technique consists of converting the regular expression into a DFA and then searching the text using the DFA. The simplest solution is to build first an NFA with a technique like those shown in the previous sections (e.g., Thompson or Glushkov) and then convert the NFA into a DFA.

现在一般做法是，先把regular expression变化为NFA(e.g., Thompson or Glushkov) ，然后变化为DFA。

This algorithm, which we call DFA Classical, can be found in any classical book of compilers, such as [ASU86]. The main idea is as follows. When we traverse the text using a nondeterministic automaton, a number of transitions can be followed and a set of states become active. However, a DFA has exactly one active state at a time. So the corresponding deterministic automaton is defined over theset of states of the nondeterministic automaton. The key idea is that the unique current state of the DFA is the set of current states of the NFA.

DFA Classical是很经典的算法（虽然我现在一点儿都不知道- - ）。这一段中心句：The key idea is that the unique current state of the DFA is the set of current states of the NFA.

Definition The ε-closure of a state s in an NFA, E(s), is the set of states of the NFA that can be reached froms by ε-transitions.

E(s), 就是从s出发经过ε-transitions可以到达的状态集合，例如，在Glu算法中，E(s)={s}。但是，由于Tho算法中，由于有ε-transitions结果不一定。

We can give now a formal definition of the conversion of the NFA into a DFA. Let the NFA be (Q,Σ, I, F, Δ) according to Section 1.3.3. Then the DFA is defined as

where

and

that is, for every possible active states of S we follow all the possible transitions to statess′ by the character σ and then follow all the possible ε-transitions froms′.

正规定义，公式、符号看不懂。凭经验来说，这东西看不懂也没有大问题，看懂更好。

Since the DFA is built on the set of states of the NFA, its worst-case size isO(2^m) states, which is exponential. This makes the approach suitable for small regular expressions only. In practice, however, most of those states are not reachable from the initial state and therefore do not need to be built.

NFA转化为DFA的 worst-case size is O(2^m) states，好像只适合小规模的正则表达式。但是，实际上有许多状态不会到达。

We now give an algorithm that obtains the DFA from the NFA by building only the reachable states. The algorithm uses sets of NFA states as identifiers for the DFA states. A simple way to represent these sets is to use a boolean array. Note that a bit-parallel representation is also possible, and it permits not only more compact storage but also faster handling of the set union and other required set operations. We give specific bit-parallel algorithms inSection 5.4. For now, we use just an abstract representation of the sets of states.

将要给出一个只从NFA可到达的状态构造DFA的算法。The algorithm uses sets of NFA states as identifiers for the DFA states. 这样还可以使用位表示，便于节省空间以及有位并行算法使用。

接下来给出了一个计算ε-closure 的算法，略了。注意在Glu生成的NFA中，E(s)={s}。

然后，又给出了DFA的构造算法。好像就是那个经典算法，我去翻翻教材看看有没有。

恩，果然教材上有，应该经典教材上都有这个方法吧。

This is not guaranteed in general. Different DFAs may exist to recognize the same language. Moreover, our construction does not guarantee that the result has the minimum size. To ensure this we have tominimize the DFA after we build it. Minimization of DFAs is a standard technique that can be found in a classical book such as [ASU86]. We content ourselves with the simple construction, which in most cases produces a DFA of reasonable size.

构造dfa时，得到的并不一定是最简。需要进一步的判断。化简的好处当然不用说了。

Searching with the DFAThe point of building the DFA is to guarantee a linear search time ofO(n). This is achievable because we need to cross exactly one transition per text character read. However, we need to modify the automaton in order to use it for text searching. The modification consists of adding a self-loop to the initial state of the NFA, which can be crossed by any character

构建DFA是为了现行的搜索时间。我们需要给自动机加点东西，来让他更好地为文本匹配服务，例如为起始状态加上自回环。

If the original automaton recognizes the languageL(RE), then after this modification the automaton recognizesΣ*L(RE).Figure 5.13 shows the resulting DFA after adding the self-loop to the Glushkov NFA ofFigure 5.7.

Figure 5.13: DFA obtained after adding an initial self-loop to the Glushkov automaton ofFigure 5.7. It is equivalent to the regular expression(A|C|G|T)* (AT|GA) ((AG|AAA)*).

类似于数电画状态转移图时，加上各种没明说的状态。

The complete search algorithm is depicted inFigure 5.14. The total complexity isO(m²2^m +n) in the worst case. The extra space needed to represent the DFA isO(m2^m) bits.