计算器的算法编写
In my previous article (Writing a Parser — Getting Started), I discussed the basic architecture of a parser, some terminologies, and when to go for a handwritten parser and when to not. I also discussed some high-level details of different components of a parser and their requirements.
在上一篇文章 ( 编写解析器-入门 )中,我讨论了解析器的基本体系结构,一些术语以及何时去使用手写解析器以及何时不去手写解析器。 我还讨论了解析器的不同组件及其要求的一些高级细节。
In this article, I will discuss in detail about each of these components, and the algorithms to be followed including some implementation details. This is more of a continuation of my previous article, so I would strongly recommend reading that before going through the rest of this article.
在本文中,我将详细讨论每个组件,以及将要遵循的算法,包括一些实现细节。 这是我上一篇文章的延续,因此,我强烈建议在阅读本文其余部分之前先阅读该内容。
字符阅读器/输入阅读器 (Character Reader / Input Reader)
The main purpose of having an abstraction of a character reader is to have a character provider that is independent of the input type. That way, the lexer can simply rely on the peek()/consume() methods to provide the characters required to produce tokens.
抽象字符读取器的主要目的是拥有独立于输入类型的字符提供程序。 这样,词法分析器可以简单地依靠peek()/ consume()方法来提供生成令牌所需的字符。
However, along with the peek()
method, there's another requirement that arises with respect to the implementation of these methods. That is, in order to facilitate peek()
/peek(k)
, the character reader should be able to keep track of the characters that were looked-ahead without discarding them and should be able to provide them again upon an invocation of another peek()
or consume()
method. In other words, the character-reader should preserve the peeked characters in memory. There are two ways to achieve this:
但是,与peek()
方法一起,这些方法的实现还有另一个要求。 也就是说,为了方便使用peek()
/ peek(k)
,字符读取器应该能够跟踪先前查找的字符而不丢弃它们,并且应该能够在调用另一个peek()
再次提供它们peek()
或consume()
方法。 换句话说,字符读取器应将偷看的字符保留在内存中。 有两种方法可以实现此目的:
- In-memory reader 内存读取器
- Stream reader 流阅读器
内存读取器 (In-memory reader)
Read the entire input at once and store it as a large character array/buffer. Maintain an index to point the last consumed character position (say i). When,
一次读取整个输入并将其存储为大字符数组/缓冲区。 保持索引以指向最后消耗的字符位置(例如i )。 什么时候,
peek(k)
is called, return the (i+k) th character from the buffer.调用
peek(k)
,从缓冲区返回第( i + k)个字符。consume(k)
is called, return the (i+k) th character from the buffer. And make i = i+k.consume(k)
被调用,返回( i + k ) 缓冲区中的第一个字符。 并令i = i + k 。
This is pretty simple to implement. But it consumes a lot of memory as the entire source-code is loaded into memory at once.
这很容易实现。 但是,由于整个源代码都立即加载到内存中,因此会占用大量内存。
流阅读器 (Stream reader)
Uses a ring-buffer (of capacity N) to hold only the peeked characters. Does not read the entire source code into memory. Maintains the buffer size (say n) and the starting position of the buffer (say i),
使用环形缓冲区 (容量N )仅容纳偷看的字符。 不会将整个源代码读取到内存中。 保持缓冲区大小(例如n )和缓冲区的起始位置(例如i ),
When a peek(k)
is called,
调用peek(k)
,
If k<n, that means k-th character is in the buffer. Then return the k-th character from the buffer.
如果k <n ,则意味着第k个字符在缓冲区中。 然后从缓冲区返回第k个字符。
If k≥n, that means k-th character is not in the buffer. Read up to the k-th character and add it to the buffer. Since already n characters are in the buffer, total k-n number of characters will be read. Then return the k-th character from the buffer.
如果k≥n ,则表示第k个字符不在缓冲区中。 读取第k个字符并将其添加到缓冲区。 由于缓冲区中已经有n个字符,因此将读取总共kn个字符。 然后从缓冲区返回第k个字符。
When consume(k)
is called, simply return the k-th element from i-th location of the buffer, and update i to be the index of the returned element.
调用consume(k)
,只需从缓冲区的第i个位置返回第k个元素,并将i更新为返回元素的索引。
This approach is very memory efficient as it only keeps k (≤n) characters at most in the memory at a given time. However, the capacity of the buffer N needs to be adjusted and fixed, depending on how many characters would be needed to peek by the lexer. e.g: If the lexer peeks 5 characters at most, then N can be set to 5 (or more). If you are unsure of how many characters the lexer would peek, it is possible to set some large value like 20, 100 or even 500, to be on the safe side, depending on the lexer implementation.
这种方法具有很高的内存效率,因为在给定时间最多只能在内存中保留k ( ≤n )个字符。 但是,缓冲区N的容量需要调整和固定,这取决于词法分析器需要窥视多少个字符。 例如:如果词法分析器最多只能看到5个字符,则N可以设置为5(或更多)。 如果不确定词法分析器会偷看多少个字符,则可以根据词法分析器的实现将一些较大的值(例如20、100甚至500)设置为安全的值。
Lexer (The Lexer)
A lexer typically is a finite-state machine. It starts with an initial state. Then depending on the first character it encounters, the lexer changes the state and continues to operate/lex in that state until it reaches an end of the state. The way the tokens are produced is different from state to state.
一个词法分析器通常是一个有限状态机 。 它从初始状态开始。 然后,词法分析器将根据其遇到的第一个字符来更改状态,并继续在该状态下进行操作/词法分析,直到到达状态结尾为止。 令牌的产生方式因州而异。
For example, consider the set of characters “Hello, world!”. Lexer starts in a “default” mode where comma (,) is a single token - whitespace is an end of a token - and the exclamation mark is an operator-token. Once the lexer reaches the first double-quotes, it goes to a “string-literal” state where commas, whitespaces, and all other characters (except double-quotes) are also considered as part of the token. However, this string-literal-mode comes to an end once it reaches the second double quote, and completes processing that string literal. Then the lexer emits this string literal as a token and then goes back to the default mode.
例如,考虑字符集“ Hello,world!”。 Lexer以“ 默认 ”模式启动,其中逗号( , )是单个令牌-空格是令牌的结尾-感叹号是运算符令牌。 一旦词法分析器到达第一个双引号,它将进入“ 字符串-文字 ”状态,其中逗号,空格和所有其他字符(双引号除外)也被视为标记的一部分。 但是,这种字符串文字模式在到达第二个双引号时就结束了,并完成了该字符串文字的处理。 然后,词法分析器发出此字符串文字作为令牌,然后返回到默认模式。
Corresponding code for the above lexer would look like:
上面的词法分析器的对应代码如下所示:
public class Lexer {
public void nextToken() {
char nextChar = charReader.peek();
switch (nextChar) {
case "\"":
return processStringLiteral();
case "0":
case "1":
case "2":
...
case "9":
return processNumericLiteral();
}
}
}
发出令牌 (Emitting a Token)
The lexer plays two sub-roles during the lexing/tokenizing. One is Scanning and the other one is Evaluating. Scanning is the part we discuss in the previous section where the lexer checks/scan the input characters and determine the matching grammar rule. Then depending on the content of the scanned characters, it emits a relevant token. A token not only consist of the character content but also it has a type/kind associated with it. Generally, tokens emitted by a lexer falls on to four major categories:
词法分析器在词法分析/标记化过程中扮演两个子角色。 一个是扫描 ,另一个是评估 。 扫描是我们在上一节中讨论的部分,词法分析器检查/扫描输入字符并确定匹配的语法规则。 然后,根据扫描字符的内容,它会发出一个相关的令牌。 令牌不仅包含字符内容,而且具有与之关联的类型/种类。 通常,词法分析器发出的令牌分为以下四个主要类别:
Keywords — Predefined set of names (identifiers) that are known by the lexer and the parser.
关键字 —词法分析器和解析器已知的一组预定义名称(标识符)。
Identifiers — Names whose values are dynamic
标识符 -动态值的名称
Literals — String literals, numeric literals, etc. Values are dynamic.
文字 —字符串文字,数字文字等。值是动态的。
Syntax tokens — Predefined set of tokens such as separators, arithmetic operators, etc. e.g: { , } , [ , + , — , ( , )
语法标记 — 标记的预定义集合,例如分隔符,算术运算符等,例如: {,},[,+,—,(,)
展望 (Lookahead)
There are situations where the lexer cannot determine which lexer rule to process just by looking at the next character only. One example in java is the greater-than operator (>), the signed-right-shit operator (>>), and the unsigned-right-shit operator (>>>). When the lexer reaches the first ‘>’, it cannot simply determine which of the three tokens it would be processing. In order to determine that, the lexer has to check the 2nd character. If the second character is not ‘>’, then it can terminate assuming its the greater-than operator. If the second character is also ‘>’, then it has to check the 3rd character as well.
在某些情况下,词法分析器仅通过查看下一个字符就无法确定要处理的词法分析器规则。 Java中的一个示例是大于运算符(>),有符号右击运算符(>>)和无符号右击运算符(>>>)。 当词法分析器到达第一个'>'时,它不能简单地确定它将处理三个标记中的哪个。 为了确定这一点,词法分析器必须检查第二个字符。 如果第二个字符不是'>',则可以假定其大于运算符终止。 如果第二个字符也是'>',则还必须检查第三个字符。
As you may have realized, this number of characters to lookahead can vary depending on the situation. Thus the lexer supports an arbitrary lookahead. There is no hard-bound rule to decide whats the amount of lookahead to use in each situation, but rather strictly depends on the lexical structure of the language. i.e: depends on the grammar.
您可能已经意识到,要预见的字符数可能会因情况而异。 因此,词法分析器支持任意先行。 没有硬性规定来确定每种情况下要使用的前瞻量,而是严格取决于语言的词汇结构。 即:取决于语法。
空格处理 (Whitespace handling)
As discussed in the previous article, there are many ways to treat whitespaces. However, if you are planning to go for an approach where whitespaces are attached to a token, then the next question would be: what's the best way to do that? Again, there are three approaches that can be followed when attaching whitespace to a token:
如前一篇文章所述 ,有很多方法可以处理空白。 但是,如果您打算采用将空格附加到令牌的方法,那么下一个问题将是:做到这一点的最佳方法是什么? 同样,将空白附加到令牌时可以遵循三种方法:
- Attach whitespace to the immediate next token — Every token will have leading whitespaces. Any whitespace at the end of the file will be attached to the EOF token. 将空格附加到紧邻的下一个令牌-每个令牌将有前导空格。 文件末尾的任何空格都将附加到EOF令牌。
- Attach whitespaces to last processed token — Every token will have only trailing white spaces. Needs a special start-of-file (SOF) token to capture the whitespaces at the start of a file. 将空格附加到最后处理的令牌-每个令牌将只有尾随空格。 需要一个特殊的文件开始(SOF)令牌来捕获文件开始处的空白。
- Split the whitespace and attach them to both the last processed token and the next immediate token — A token may have both leading and trailing whitespaces. Typically, the first newline after a token will terminate the trailing whitespace (trailing whitespaces cannot have newlines). However, leading whitespaces can have newlines. 拆分空格,并将它们附加到最后处理的令牌和下一个立即令牌上-令牌可能同时具有前导空格和尾随空格。 通常,令牌后的第一个换行符将终止尾随空格(尾随空格不能包含换行符)。 但是,前导空格可以包含换行符。

解析器 (The Parser)
There are two basic approaches for parsing:
有两种基本的解析方法:
Top-down parsing — Parses the input by searching the formal-grammar rules in a top-down manner. In other words, the high-level structure of input is decided first, before parsing its children. For example, when parsing a top-level element of a file, the parser first decides the kind of the element it is going to parse (e.g: whether it is a function-definition or a variable definition). Then only it starts parsing the children of that top-level element. In top-down parsing, the left-most derivation of a grammar production is matched first. Thus, the tokens are consumed and matched from left to right. LL (Left-to-right, Leftmost derivation) parsers are examples of top-down parsers.
自上而下的解析 —通过自上而下的方式搜索形式语法规则来解析输入。 换句话说,在解析其子级之前,首先确定输入的高级结构。 例如,当解析文件的顶级元素时,解析器首先确定要解析的元素的类型(例如:它是函数定义还是变量定义)。 然后只有它开始解析该顶级元素的子级。 在自上而下的解析中,首先对语法产生的最左派进行匹配。 因此,令牌被消耗并从左到右匹配。 LL(L EFT到右,L eftmost推导)解析器的自上而下的解析器的示例。
Bottom-up parsing — The parser attempts to parse the leaf nodes first, and then depending on these parsed nodes, it decides the parent and parses it. In bottom-up parsing, the right-most derivation of a grammar production is matched first. LR (Left-to-right, Rightmost derivation) parsers are examples of bottom-up parsers.
自下而上的解析-解析器尝试首先解析叶节点,然后根据这些解析的节点决定父节点并对其进行解析。 在自下而上的解析中,首先对语法产生的最右派生进行匹配。 LR(L EFT到右,R ightmost推导)解析器自下而上解析器的示例。
Construction of the parse-tree can be done either top-down or bottom-up manner in both of these approaches.
在这两种方法中,解析树的构建可以自上而下或自下而上的方式进行。
Both LL and LR parsers are generally referred with a number associated with it like LL(1), LR(1), LL(k), etc. This number (generically ‘k’) means the number of lookaheads used by the parser (similar to the lexer). That is, taking top-down parsing for example, the parser needs to lookahead some tokens in order to determine which top-level node it should be parsing. In this article, I will focus on LL parsing as they tend to be much simpler compared to bottom-up parsers.
LL和LR解析器通常都使用与其关联的数字来引用,例如LL(1),LR(1),LL(k)等。此数字(通常为“ k”)表示解析器使用的超前次数(类似于词法分析器)。 也就是说,以自顶向下的解析为例,解析器需要提前一些令牌,以确定应该解析哪个顶级节点。 在本文中,我将重点介绍LL解析,因为与自底向上解析器相比,它们通常更简单。
LL(k)递归下降解析器 (LL(k) Recursive descent parsers)
LL parsers need to check some tokens into the future to decide the grammar rule to process when there are alternative rules. Sometimes this number of token to lookahead is a constant. In such cases, the number is written as a numeric value. However, for most parsers, this value depends on the grammar rule and the current input (i.e: depends on the context) and cannot be pre-determined. Such parsers use an arbitrary k, which is denoted as LL(k) or LR(k).
LL解析器需要在将来检查一些令牌,以便在有替代规则时决定要处理的语法规则。 有时,要预见的令牌数量是一个常数。 在这种情况下,数字将写为数值。 但是,对于大多数解析器,此值取决于语法规则和当前输入(即:取决于上下文),并且无法预先确定。 此类解析器使用任意k ,表示为LL(k)或LR(k)。
A parser is said to be recursive descent in the sense, the implementation is a set of small functions that parses a particular component of the grammar. These functions may depend on each other, thus creating a mutual recursion.
从某种意义上说,解析器被认为是递归的,实现是一组小的功能,用于解析语法的特定组成部分。 这些功能可能相互依赖,因此产生了相互递归。
e.g: AparseExpression()
method may call parserBinaryExpression()
method, if the currently processing expression is a binary expression (e.g:a+b
). But since both the RHS and the LHS of the binary-expression is again some kind of expression, parserBinaryExpression()
will call the parseExpression()
within, creating a recursion. See below for a sample code:
例如:如果当前正在处理的表达式是二进制表达式(例如: a+b
),则parseExpression()
方法可以调用parserBinaryExpression()
方法。 但是由于二进制表达式的RHS和LHS都还是某种表达式,因此parserBinaryExpression()
将在其中调用parseExpression()
,从而创建递归。 参见下面的示例代码:
Expression parseExpression() {
...
if (binaryExpr) {parseBinaryExpression();
}
...
}Expression parseBinaryExpression() {
lhsExpr = parseExpression();
binaryOp = parseBinaryOp();
rhsExpr = parseExpression();
...
}
预测与回溯 (Predictive vs Backtracking)
Predictive parser— Look into the tokens ahead in advanced to decide which grammar rule to process
预测式解析器 —预先研究标记,以确定要处理的语法规则
Backtracking parser— Tries to parse an alternative rule. If the parser realizes that the selected grammar rule does not match the input, it goes back (backtracks) to the point where the parsing starts for that rule, starts trying the second alternative rule. Likewise, it tries all the alternatives and proceeds with the best one.
回溯解析器 —尝试解析替代规则。 如果解析器意识到所选语法规则与输入不匹配,它将返回(回溯)到对该规则开始解析的点,开始尝试第二个替代规则。 同样,它会尝试所有替代方案并进行最佳选择。
Predictive parsers do not try out all the alternative rules, hence they are faster. But on the other hand, backtracking parsers try out all the alternatives, hence tend to yield better results upon syntax errors.
预测解析器不会尝试所有替代规则,因此它们会更快。 但是另一方面,回溯解析器会尝试所有替代方法,因此倾向于在语法错误时产生更好的结果。
语法要求 (Grammar requirements)
When implementing the parsing logic for a grammar-rule in an LL(k) recursive-descent predictive-parser, the grammar must satisfy the following:
在LL(k)递归下降式预测解析器中为语法规则实现解析逻辑时,语法必须满足以下条件:
Must be non-left recursive
必须是非左递归
If the grammar rule is left-recursive, it must be re-written in a way such that left-recursion is eliminated. eg:
如果语法规则是左递归,则必须以消除左递归的方式对其进行重写。 例如:
binary-expr := expression binary-operator expressionexpression := binary-expr | basic-literal | . . .
Here binary-expr rule is left recursive as the left-side of the production (lhs of the binary-operator) can again be another binary-expr. So if we try to implement this, our parse functions would ended-up getting stuck in an infinite recursion. To avoid that, left-recursion of the grammar rule must be eliminated. This can be done by rewriting the above rule as:
这里,binary-expr规则是递归的,因为生产的左侧(binary-operator的lhs)可以再次成为另一个binary-expr。 因此,如果我们尝试实现此功能,则解析函数最终将陷入无限递归中。 为了避免这种情况,必须消除语法规则的左递归。 可以通过将以上规则重写为:
binary-expr := terminal-expression binary-operator expressionterminal-expression := basic-literal | var-ref | func-call | . . .expression := binary-expr | terminal-expression | . . .
Here a terminal-expr is any expression that has higher precedence compared to a binary expression. i.e: any expression that can be a leaf node in the binary-tree of a binary expression.
在这里, terminal-expr是比二进制表达式具有更高优先级的任何表达式。 即:可以是二进制表达式的二进制树中的叶节点的任何表达式。
2. Must be linear
2. 必须是线性的
Consider the below grammar rule.
请考虑以下语法规则。
function := func-def | func-declfunc-def := functionSig ‘{‘ statements ‘}’func-decl:= functionSig ‘;’
Here both func-def and func-decl have a common component functionSig. The problem with these kinds of rules is that, during parsing, the parser will first try to match the source to the func-def and if that fails, it will try matching it to the func-decl. However, during this process, functionSig would be traversed twice. This can be avoided by rewriting the grammar rule in such a way that common productions are taken out of the alternative rules. eg:
这里func-def和func-decl都有一个共同的组件functionSig 。 这些规则的问题在于,在解析期间,解析器将首先尝试将源与func-def匹配,如果解析失败,它将尝试将其与func-decl匹配。 但是,在此过程中, functionSig将被遍历两次。 可以通过重写语法规则来避免这种情况,这样可以从其他规则中去除普通的产生式。 例如:
function := functionSig (func-body | ‘;’)func-body := ‘{‘ statements ‘}’
解析器表 (Parser Table)
In theory, you would have heard that LL(k) predictive parsers are based on a parse table, and we would need to construct a parse table. Parse table simply defines what character to expect and the next rule to follow, depending on the grammar. The parser follows/lookup this table to determine what token to parse.
从理论上讲,您会听说LL(k)预测解析器基于解析表 ,我们需要构造一个解析表。 语法分析表仅定义期望的字符以及遵循的下一个规则,具体取决于语法。 解析器跟随/查找该表以确定要解析的令牌。
However, in practice, we don’t need a parser table at all! That may sound a bit controversial, but there are few reasons for that:
但是,实际上,我们根本不需要解析器表! 这听起来可能有点争议,但是原因很少:
- Need to construct a parsing table before start parsing. This means the startup time of the parser increases. 开始解析之前,需要构造一个解析表。 这意味着解析器的启动时间增加。
- For complex grammars, this parse table can be extremely complex and large. I mean, huge! 对于复杂的语法,此解析表可能非常复杂且很大。 我的意思是,巨大!
- After every token, the parser has to lookup the parse-table to figure out what to parse next. This results in so many table lookups which may eventually impact the performance of the parser. 在每个令牌之后,解析器必须查找解析表以找出下一步要解析的内容。 这导致太多的表查找,最终可能会影响解析器的性能。
So, the alternative is to embed this parse table information to the implementation itself. If I take the same previous example of parsing a binary expression:
因此,替代方法是将此解析表信息嵌入到实现本身。 如果我采用相同的先前示例来解析二进制表达式:
Expression parseBinaryExpression() {
lhsExpr = parseExpression();
binaryOp = parseBinaryOp();
rhsExpr = parseExpression();
...
}
Here, the expected tokens and their order is embedded in the logic of the parseBinaryExpression()
method itself. It no longer needs the help of a parsing table.
在这里,期望的标记及其顺序被嵌入到parseBinaryExpression()
方法本身的逻辑中。 它不再需要解析表的帮助。
In this article, I discussed the algorithms and some implementation details of the lexer and the parser. In my next article (Writing a Parser — Syntax Error Handling), I will discuss the details of handling syntax errors and the error recovery mechanisms.
在本文中,我讨论了词法分析器和解析器的算法以及一些实现细节。 在我的下一篇文章(编写解析器-语法错误处理)中 ,我将讨论处理语法错误和错误恢复机制的详细信息。
翻译自: https://medium.com/swlh/writing-a-parser-algorithms-and-implementation-a7c40f46493d
计算器的算法编写