Classical approaches to regular expression searching

本文详细介绍了正则表达式匹配算法中的关键步骤,即如何将NFA(非确定有限自动机)转换为DFA(确定有限自动机),并解释了此过程背后的原理。通过构建DFA,实现了一种O(n)时间复杂度的字符串匹配算法,提高了搜索效率。此外,文章还探讨了如何通过自回环优化NFA以增强其在文本搜索中的应用,并提供了DFA构建算法的概述。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

5.3.1 Thompson's NFA simulation

是现在许多相关算法的基础,作者没有仔细介绍,就不研究了。

5.3.2Using a deterministic automaton

One of the early achievements in string matching was theO(n) time algorithm to search for a regular expression in a text. As explained, the technique consists of converting the regular expression into a DFA and then searching the text using the DFA. The simplest solution is to build first an NFA with a technique like those shown in the previous sections (e.g., Thompson or Glushkov) and then convert the NFA into a DFA.

现在一般做法是,先把regular expression变化为NFA(e.g., Thompson or Glushkov) ,然后变化为DFA。

 

This algorithm, which we call DFA Classical, can be found in any classical book of compilers, such as [ASU86]. The main idea is as follows. When we traverse the text using a nondeterministic automaton, a number of transitions can be followed and a set of states become active. However, a DFA has exactly one active state at a time. So the corresponding deterministicautomaton is defined over theset of states of the nondeterministic automaton. The key idea is that the unique current state of the DFA is the set of current states of the NFA.

DFA Classical是很经典的算法(虽然我现在一点儿都不知道- - )。这一段中心句:The key idea is that the unique current state of the DFA is the set of current states of the NFA.

 

Definition The ε-closure of a state s in an NFA, E(s), is the set of states of the NFA that can be reached froms by ε-transitions.

E(s), 就是从s出发经过ε-transitions可以到达的状态集合,例如,在Glu算法中,E(s)={s}。但是,由于Tho算法中,由于有ε-transitions结果不一定。

 

We can give now a formal definition of the conversion of the NFA into a DFA. Let the NFA be (Q,Σ, I, F, Δ) according to Section 1.3.3. Then the DFA is defined as

 

where

Image from book

and

 

that is, for every possible active states of S we follow all the possible transitions to statess by the character σ and then follow all the possible ε-transitions froms.

正规定义,公式、符号看不懂。凭经验来说,这东西看不懂也没有大问题,看懂更好。

 

Since the DFA is built on the set of states of the NFA, its worst-case size isO(2m) states, which is exponential. This makes the approach suitable for small regular expressions only. In practice, however, most of those states are not reachable from the initial state and therefore do not need to be built.

NFA转化为DFA的 worst-case size is O(2m) states,好像只适合小规模的正则表达式。但是,实际上有许多状态不会到达。

 

We now give an algorithm that obtains the DFA from the NFA by building only the reachable states. The algorithm uses sets of NFA states as identifiers for the DFA states. A simple way to represent these sets is to use a boolean array. Note that a bit-parallel representation is also possible, and it permits not only more compact storage but also faster handling of the set union and other required set operations. We give specific bit-parallel algorithms inSection 5.4. For now, we use just an abstract representation of the sets of states.

将要给出一个只从NFA可到达的状态构造DFA的算法。The algorithm uses sets of NFA states as identifiers for the DFA states. 这样还可以使用位表示,便于节省空间以及有位并行算法使用。

接下来给出了一个计算ε-closure 的算法,略了。注意在Glu生成的NFA中,E(s)={s}。

然后,又给出了DFA的构造算法。好像就是那个经典算法,我去翻翻教材看看有没有。

恩,果然教材上有,应该经典教材上都有这个方法吧。

 

This is not guaranteed in general. Different DFAs may exist to recognize the same language. Moreover, our construction does not guarantee that the result has the minimum size. To ensure this we have tominimize the DFA after we build it. Minimization of DFAs is a standard technique that can be found in a classical book such as [ASU86]. We content ourselves with the simple construction, which in most cases produces a DFA of reasonable size.

构造dfa时,得到的并不一定是最简。需要进一步的判断。化简的好处当然不用说了。

 

Searching with the DFAThe point of building the DFA is to guarantee a linear search time ofO(n). This is achievable because we need to cross exactly one transition per text character read. However, we need to modify the automaton in order to use it for text searching. The modification consists of adding a self-loop to the initial state of the NFA, which can be crossed by any character

构建DFA是为了现行的搜索时间。我们需要给自动机加点东西,来让他更好地为文本匹配服务,例如为起始状态加上自回环。

 

If the original automaton recognizes the languageL(RE), then after this modification the automaton recognizesΣ*L(RE).Figure 5.13 shows the resulting DFA after adding the self-loop to the Glushkov NFA ofFigure 5.7.


Figure 5.13: DFA obtained after adding an initial self-loop to the Glushkov automaton ofFigure 5.7. It is equivalent to the regular expression(A|C|G|T)* (AT|GA) ((AG|AAA)*).

类似于数电画状态转移图时,加上各种没明说的状态。

 

The complete search algorithm is depicted inFigure 5.14. The total complexity isO(m22m +n) in the worst case. The extra space needed to represent the DFA isO(m2m) bits.

 

5.3.3 A hybrid approach

这个方法作者没细说,略了。

经典网络是指在计算机科学领域中,早期应用广泛的网络结构类型。它指的是在互联网出现之前使用的传统网络结构。经典网络通常基于分层结构设计,由若干个网络层组成,每一层负责特定的功能。 在经典网络中,最常见的是OSI模型(开放系统互联模型)和TCP/IP模型(传输控制协议/互联网协议)这两种标准化网络模型。OSI模型被广泛接受并应用于网络协议的设计和实现,它将网络通信分为七个层次,包括物理层、数据链路层、网络层、传输层、会话层、表示层和应用层。TCP/IP模型则将网络通信分为四个层次,包括网络接口层、网络层、传输层和应用层。 经典网络架构为数据传输提供了良好的结构和框架。例如,经典网络中的分层结构能够将不同的功能分配给不同的层次,提高了网络的可维护性和可扩展性。每个层次都有确定的功能和责任,使得网络的开发和管理更加简化。 然而,随着互联网的发展和技术的进步,经典网络逐渐被更为高级和复杂的网络架构所取代,如软件定义网络(SDN)和云计算等。这些新兴的网络架构更加灵活和可扩展,能够更好地适应现代应用的需求。 综上所述,经典网络是指互联网出现之前使用的传统网络架构类型,具有分层结构和规范化的网络模型。尽管在现代网络发展中逐渐被取代,但经典网络的设计思想和基本原理仍然对我们理解和构建网络系统有着重要的意义。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值