lex&yacc系列(2)--- lex介绍及实例

最新推荐文章于 2025-05-27 15:28:17 发布

First Snowflakes

最新推荐文章于 2025-05-27 15:28:17 发布

阅读量5.5k

点赞数 1

CC 4.0 BY-SA版权

分类专栏：编译器

本文链接：https://blog.youkuaiyun.com/qq_35865125/article/details/86755170

编译器专栏收录该内容

13 篇文章

订阅专栏

本文深入探讨了词法分析的基本概念，介绍了如何使用Lex工具进行词法分析，并通过两个示例详细展示了如何识别单词及构建符号表，旨在帮助读者理解词法分析在编译原理中的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

A question that sometimes drives me hazy--am I or the others crazy?——Einstein

Lex:

For a C program, the units are variable names, constants, strings, operators, punctuation(标点), and so forth. This division into units (which are usually called tokens) is known as lexical analysis(词法分析), or lexing for short.

Lex helps you by taking a set of descriptions of possible tokens and producing a C routine, which we call a lexical analyzer, or a lexer, or a scanner for short, that can identify those tokens. The set of description you give to lex is called a lex specification.

The token descriptions that lex uses are known as regular expressions, extended versions of the familiar patterns used by the grep and egrep commands.

Lex例子1：Recognizing Words with Lex

(源代码附在本文最后)

分析：

上图对应的是xx.l文件，该文件被lex工具读取，生成一个c语言函数，该函数可以被用户调用来进行语法分析。xx.l文件的结构以及语法由设计lex的人进行了规定:

This first section, the definition section, introduces any initial C program code we want copied into the final program. This is especially important if, for example, we have header files that must be included for code later in the file to work. We surround the C code with the special delimiters “%{” and “%}.” Lex copies the material between “%{” and “%}” directly to the generated C file, so you may write any valid C code here.

Outside of “%{” and “%}”, comments must be indented with whitespace for lex to recognize them correctly. We’ve seen some amazing bugs when people forgot to indent their comments and lex interpreted them as something else.

The %% marks the end of this section.

The next section is the rules section. Each rule is made up of two parts: a pattern and an action, separated by whitespace. The lexer that lex generates will execute the action when it recognizes the pattern. These patterns are UNIX-style regular expressions, a slightly extended version of the same expressions used by tools such as grep, sed, and ed. Chapter6 describes all the rules for regular expressions.

The first rule in our example is the following:

[\t ]+ /* ignore whitespace */ ;

The square brackets, “[]”, indicate that any one of the characters within the brackets matches the pattern. For our example, we accept either “\t” (a tab character) or " " (a space). The “+” means that the pattern matches one or more consecutive copies of the subpattern that precedes the plus. Thus, this pattern describes whitespace (any combination of tabs and spaces.) The second part of the rule, the action, is simply a semicolon(分号), a do-nothing C statement. Its effect is to ignore the input.

The next set of rules uses the “|” (vertical bar) action. This is a special action that means to use the same action as the next pattern, so all of the verbs use the action specified for the last one.

The array yytext(一个指针) contains the text that matched the pattern. (yytext是一个字符指针，指向当前被识别出来的字符串)

The next rule:

[a-zA-Z]+ { printf("%s: is not a verb\n", yytext); }

It doesn’t take long to realize that any word that matches any of the verbs listed in the earlier rules will match this rule as well. You might then wonder why it won’t execute both actions when it sees a verb in the list. And would both actions be executed when it sees the word “island,” since “island” starts with “is”? The answer is that lex has a set of simple disambiguating rules.

The two that make our lexer work are:
1. Lex patterns only match a given input character or string once.

2. Lex executes the action for the longest possible match for the current input. Thus, lex would see “island” as matching our all-inclusive rule because that was a longer match than “is.”

The last line is the default case:

.|\n { ECHO; /* normal default anyway */ }

The special character “.” (period) matches any single character other than a newline, and “\n” matches a newline character.

The special action ECHO prints the matched pattern on the output, copying any punctuation(标点符号) or other characters. Even though there is a default action for unmatched input characters, wellwritten lexers invariably have explicit rules to match all possible input.

The end of the rules section is delimited by another %%.

The final section is the user subroutines section, which can consist of any legal C code. Lex copies it to the C file after the end of the lex generated code. We have included a main() program.

main() //加一个main函数，从而编译后可以直接运行。

{

yylex(); //在自己添加的main函数中调用lex自动产生的yylex函数，启动语法分析。

}

The lexer produced by lex is a C routine called yylex(), so we call it. Unless the actions contain explicit return statements, yylex() won’t return until it has processed the entire input.

编译执行：

在ubuntu里面安装flex： sudo apt-get install flex，

然后flex xxx.l，便会生成该.l文件对应的lex.yy.c，该c文件中有自动生成的yylex函数： #define YY_DECL int yylex (void)。

然后，用gcc编译： gcc –o word lex.yy.c，生成可执行文件word。

然后，运行 ./word，运行后输入 is go uuuxx然后回车，则开始分析被输入的字符串，回车后可以继续再输入新的字符串，再回车。是yylex函数内有一个循环，不断接受新来的数据，而且在.l文件中我们没有在检测到匹配的字串之后调用return。--- 输入的字串是如何传递该yylex函数的？？--看到在yylex函数中有：

if ( ! yyin )

yyin = stdin;

注意：gcc编译时可能会报错：undefined reference to `yywrap', -- flex版本问题。

解决方法一是：flex xxx.l –lfl ,用flex的库函数，但是编译出来执行时有问题。

解决方法二是：自己定义yywrap函数，该函数返回1即可，该函数加在xxx.l的第一部分：---- 测试通过

int

yywrap(void)

{

return 1;

}

解决方法三是： %option noyywrap --- 尚未测试

Ref： https://blog.youkuaiyun.com/anzhuangguai/article/details/49124675

int yywrap(void)

Called by yylex at end-of-file; the default yywrap

always will return 1. If the application requires

yylex to continue processing with another source of

input, then the application can include a function

yywrap, which associates another file with the exter-

nal variable FILE *yyin and will return a value of

zero.

Lex例子2：symbol Tables

(源代码附在本文最后)

It would be more convenient, though, if we could build a table of words as the lexer is running。

Example2—part1:

* Word recognizer with a symbol table.

enum {

LOOKUP =0, /* default - looking rather than defining. */

VERB,

ADJ,

ADV,

NOUN,

REP,

PRON,

CONJ

};

int state;

int add_word(int type, char *word);

int lookup_word(char *word);

int

yywrap(void)

{

return 1;

}

分析：定义了enum，用来指示变量的类型。定义变量state，声明函数add_word, lookup_word。

Example2—part2:

\n { state = LOOKUP; } /* end of line, return to default state */

/* whenever a line starts with a reserved part of speech name start defining words of that type */

^verb { state = VERB; }

^adj { state = ADJ; }

^adv { state = ADV; }

^noun { state = NOUN; }

^prep { state = PREP; }

^pron { state = PRON; }

^conj { state = CONJ; }

[a-zA-Z]+ {

/* a normal word, define it or look it up */

if(state != LOOKUP) {

/* define the current word */

add_word(state, yytext);

}

else {

switch(lookup_word(yytext)) {

case VERB: printf("%s: verb\n", yytext); break;

case ADJ: printf("%s: adjective\n", yytext); break;

case ADV: printf("%s: adverb\n", yytext); break;

case NOUN: printf("%s: noun\n", yytext); break;

case PREP: printf("%s: preposition\n", yytext); break;

case PRON: printf("%s: pronoun\n", yytext); break;

case CONJ: printf("%s: conjunction\n", yytext); break;

default:

printf("%s: don't recognize\n", yytext);

break;

}

. /* ignore anything else */ ;

分析: 碰到verb，adj这种类型字符串时，将变量state置为相应的枚举值，跟在类型字符串后面的的非类型字符串，则可以被定义，例如”verb vfg vgang”,则可以定义两个变量，变量的类型为 VERB。

The caret(补字符号), “^”, at the beginning of the pattern makes the pattern match only at the beginning of an input line.—如果有这样一行”verb adj vfg vgang”会怎么样？将adj当作一个verb类型的变量吗？？应该是的。

遇到一行的结尾时，令state = LOOKUP，从而在新的一行中如果直接出现变量名字，则可以调用lookup_word函数产找该变量是否已经被添加进了表里面，找到了就输出它的类型。

Example2—part3:

main()

{

yylex();

}

/* define a linked list of words and types */

struct word {

char *word_name;

int word_type;

struct word *next;

};

struct word *word_list; /* first element in word list */

extern void *malloc() ;

int

add_word(int type, char *word)

{

struct word *wp;

if(lookup_word(word) != LOOKUP) {

printf("!!! warning: word %s already defined \n", word);

return 0;

}

/* word not there, allocate a new entry and link it on the list */

wp = (struct word *) malloc(sizeof(struct word));

wp->next = word_list;

/* have to copy the word itself as well */

wp->word_name = (char *) malloc(strlen(word)+1);

strcpy(wp->word_name, word);

wp->word_type = type;

word_list = wp;

return 1; /* it worked */

}

int

lookup_word(char *word)

{

struct word *wp = word_list;

/* search down the list looking for the word */

for(; wp; wp = wp->next) {

if(strcmp(wp->word_name, word) == 0)

return wp->word_type;

}

return LOOKUP; /* not found */

}

解析：这两个函数用于添加变量到链表，从链表中搜索变量。搜索方法比较慢，在实际情况下，可以使用哈希表等高速方法。

编译运行的方法同例子1，运行实例：编译出程序后，开始运行，然后输入字符串：

其中，黑体表示用户从屏幕上输入的字符，细体表示程序的输出。

例子1的.l文件：

%{

/*
  sample example.
*/
int
yywrap(void)
{
    return 1;
}
%}
%%

[/t ]+  /*ignore white space*/;
is |
am |
did |
go {printf("%s: is a verb\n",yytext);}
[a-zA-Z]+  {printf("%s: is not a verb\n",yytext);}

.|\n {ECHO;}
%%
main()
{
  yylex();
}

例子2的.l文件

%{
/*
 * Word recognizer with a symbol table.
 */
enum {
 LOOKUP =0, /* default - looking rather than defining. */
 VERB,
 ADJ,
 ADV,
 NOUN,
 PREP,
 PRON,
 CONJ
};
int state;
int add_word(int type, char *word);
int lookup_word(char *word);

int
yywrap(void)
{
    return 1;
}

%}

%%
\n  { state = LOOKUP; } /* end of line, return to default state */

 /* whenever a line starts with a reserved part of speech name start defining words of that type */
^verb  { state = VERB; }
^adj  { state = ADJ; }
^adv  { state = ADV; }
^noun  { state = NOUN; }
^prep  { state = PREP; }
^pron  { state = PRON; }
^conj  { state = CONJ; }

[a-zA-Z]+ {
 	/* a normal word, define it or look it up */
 	if(state != LOOKUP) {
 	/* define the current word */
 		add_word(state, yytext);
 	} 
else {
	switch(lookup_word(yytext)) {
 		case VERB: printf("%s: verb\n", yytext); break;
 		case ADJ: printf("%s: adjective\n", yytext); break;
 		case ADV: printf("%s: adverb\n", yytext); break;
 		case NOUN: printf("%s: noun\n", yytext); break;
 		case PREP: printf("%s: preposition\n", yytext); break;
 		case PRON: printf("%s: pronoun\n", yytext); break;
 		case CONJ: printf("%s: conjunction\n", yytext); break;
 	default:
 		printf("%s: don't recognize\n", yytext);
 	break;
 	}
}
 }
.  /* ignore anything else */ ;
%%
main()
{
 	yylex();
}
/* define a linked list of words and types */
struct word {
 	char *word_name;
 	int word_type;
 	struct word *next;
};
struct word *word_list; /* first element in word list */
extern void *malloc() ;

int
add_word(int type, char *word)
{
	struct word *wp;
 	if(lookup_word(word) != LOOKUP) {
 		printf("!!! warning: word %s already defined \n", word);
 		return 0;
 	}
/* word not there, allocate a new entry and link it on the list */
 	wp = (struct word *) malloc(sizeof(struct word));
	wp->next = word_list;
 	/* have to copy the word itself as well */
	wp->word_name = (char *) malloc(strlen(word)+1);
 	strcpy(wp->word_name, word);
 	wp->word_type = type;
 	word_list = wp;
 	return 1; /* it worked */
}
int
lookup_word(char *word)
{
 	struct word *wp = word_list;
 	/* search down the list looking for the word */
 	for(; wp; wp = wp->next) {
 		if(strcmp(wp->word_name, word) == 0)
 			return wp->word_type;
 	}
 	return  LOOKUP; /* not found */
}

Ref:

《lex and yacc--second edition》 – 作者：John R. Levine , chapter 1