lex&yacc系列(2)--- lex介绍及实例

本文深入探讨了词法分析的基本概念,介绍了如何使用Lex工具进行词法分析,并通过两个示例详细展示了如何识别单词及构建符号表,旨在帮助读者理解词法分析在编译原理中的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

                           A question that sometimes drives me hazy--am I or the others crazy?——Einstein


Lex:

For a C program, the units are variable names, constants, strings, operators, punctuation(标点), and so forth. This division into units (which are usually called tokens) is known as lexical analysis(词法分析), or lexing for short.

Lex helps you by taking a set of descriptions of possible tokens and producing a C routine, which we call a lexical analyzer, or a lexer, or a scanner for short, that can identify those tokens. The set of description you give to lex is called a lex specification.

The token descriptions that lex uses are known as regular expressions, extended versions of the familiar patterns used by the grep and egrep commands.


Lex例子1:Recognizing Words with Lex

(源代码附在本文最后)

 

分析:

上图对应的是xx.l文件,该文件被lex工具读取,生成一个c语言函数,该函数可以被用户调用来进行语法分析。xx.l文件的结构以及语法由设计lex的人进行了规定:

This first section, the definition section, introduces any initial C program code we want copied into the final program. This is especially important if, for example, we have header files that must be included for code later in the file to work. We surround the C code with the special delimiters “%{” and “%}.” Lex copies the material between “%{” and “%}” directly to the generated C file, so you may write any valid C code here.

Outside of “%{” and “%}”, comments must be indented with whitespace for lex to recognize them correctly. We’ve seen some amazing bugs when people forgot to indent their comments and lex interpreted them as something else.

The %% marks the end of this section.

 

The next section is the rules section. Each rule is made up of two parts: a pattern and an action, separated by whitespace. The lexer that lex generates will execute the action when it recognizes the pattern. These patterns are UNIX-style regular expressions, a slightly extended version of the same expressions used by tools such as grep, sed, and ed. Chapter6 describes all the rules for regular expressions.

The first rule in our example is the following:

 [\t ]+       /* ignore whitespace */ ;

The square brackets, “[]”, indicate that any one of the characters within the brackets matches the pattern. For our example, we accept either “\t” (a tab character) or " " (a space). The “+” means that the pattern matches one or more consecutive copies of the subpattern that precedes the plus. Thus, this pattern describes whitespace (any combination of tabs and spaces.) The second part of the rule, the action, is simply a semicolon(分号), a do-nothing C statement. Its effect is to ignore the input.

The next set of rules uses the “|” (vertical bar) action. This is a special action that means to use the same action as the next pattern, so all of the verbs use the action specified for the last one.

The array yytext(一个指针) contains the text that matched the pattern. (yytext是一个字符指针,指向当前被识别出来的字符串)

 

The next rule:

 [a-zA-Z]+     { printf("%s: is not a verb\n", yytext); }

It doesn’t take long to realize that any word that matches any of the verbs listed in the earlier rules will match this rule as well. You might then wonder why it won’t execute both actions when it sees a verb in the list. And would both actions be executed when it sees the word “island,” since “island” starts with “is”? The answer is that lex has a set of simple disambiguating rules.

The two that make our lexer work are:
1. Lex patterns only match a given input character or string once.

2. Lex executes the action for the longest possible match for the current input. Thus, lex would see “island” as matching our all-inclusive rule because that was a longer match than “is.”

 

The last line is the default case:

.|\n { ECHO;  /* normal default anyway */ }

 

The special character “.” (period) matches any single character other than a newline, and “\n” matches a newline character.

The special action ECHO prints the matched pattern on the output, copying any punctuation(标点符号) or other characters.  Even though there is a default action for unmatched input characters, wellwritten lexers invariably have explicit rules to match all possible input.

 

The end of the rules section is delimited by another %%.

 

The final section is the user subroutines section, which can consist of any legal C code. Lex copies it to the C file after the end of the lex generated code. We have included a main() program.

%%

 main() //加一个main函数,从而编译后可以直接运行。

 {

     yylex(); //在自己添加的main函数中调用lex自动产生的yylex函数,启动语法分析

 }

The lexer produced by lex is a C routine called yylex(), so we call it.  Unless the actions contain explicit return statements, yylex() won’t return until it has processed the entire input.

 

编译执行

在ubuntu里面安装flex:  sudo apt-get install flex,

然后flex  xxx.l,便会生成该.l文件对应的lex.yy.c,该c文件中有自动生成的yylex函数: #define YY_DECL int yylex (void)

然后,用gcc编译: gcc –o word lex.yy.c,生成可执行文件word。

然后,运行 ./word,运行后输入 is go uuuxx然后回车,则开始分析被输入的字符串,回车后可以继续再输入新的字符串,再回车。  是yylex函数内有一个循环,不断接受新来的数据,而且在.l文件中我们没有在检测到匹配的字串之后调用return。--- 输入的字串是如何传递该yylex函数的??--看到在yylex函数中有:

if ( ! yyin )

         yyin = stdin;

 

注意:gcc编译时可能会报错:undefined reference to `yywrap',  -- flex版本问题。

解决方法一是:flex  xxx.l –lfl ,用flex的库函数,但是编译出来执行时有问题。

解决方法二是:自己定义yywrap函数,该函数返回1即可,该函数加在xxx.l的第一部分:---- 测试通过

int

yywrap(void)

{

    return 1;

}

解决方法三是: %option noyywrap   --- 尚未测试

  Ref: https://blog.youkuaiyun.com/anzhuangguai/article/details/49124675

   int yywrap(void)

           Called by yylex at  end-of-file;  the  default  yywrap

           always  will  return  1.  If  the application requires

           yylex to continue processing with  another  source  of

           input,  then  the  application  can include a function

           yywrap, which associates another file with the  exter-

           nal  variable  FILE  *yyin  and will return a value of

           zero.


 

Lex例子2symbol  Tables

(源代码附在本文最后)

It would be more convenient, though, if we could build a table of words as the lexer is running。

Example2—part1:

%{

/*

 * Word recognizer with a symbol table.

 */

enum {

 LOOKUP =0, /* default - looking rather than defining. */

 VERB,

 ADJ,

 ADV,

 NOUN,

 REP,

 PRON,

 CONJ

};

int state;

int add_word(int type, char *word);

int lookup_word(char *word);

int

yywrap(void)

{

    return 1;

}

%}

分析:定义了enum,用来指示变量的类型。定义变量state,声明函数add_word, lookup_word。

 

Example2—part2:

%%

\n  { state = LOOKUP; } /* end of line, return to default state */

 

 /* whenever a line starts with a reserved part of speech name start defining words of that type */

^verb  { state = VERB; }

^adj  { state = ADJ; }

^adv  { state = ADV; }

^noun  { state = NOUN; }

^prep  { state = PREP; }

^pron  { state = PRON; }

^conj  { state = CONJ; }

 

[a-zA-Z]+ {

        /* a normal word, define it or look it up */

        if(state != LOOKUP) {

        /* define the current word */

                 add_word(state, yytext);

        }

else {

         switch(lookup_word(yytext)) {

                 case VERB: printf("%s: verb\n", yytext); break;

                 case ADJ: printf("%s: adjective\n", yytext); break;

                 case ADV: printf("%s: adverb\n", yytext); break;

                 case NOUN: printf("%s: noun\n", yytext); break;

                 case PREP: printf("%s: preposition\n", yytext); break;

                 case PRON: printf("%s: pronoun\n", yytext); break;

                 case CONJ: printf("%s: conjunction\n", yytext); break;

        default:

                 printf("%s: don't recognize\n", yytext);

 break;

        }

}

 }

.  /* ignore anything else */ ;

%%

分析:  碰到verb,adj这种类型字符串时,将变量state置为相应的枚举值,跟在类型字符串后面的的非类型字符串,则可以被定义,例如”verb vfg vgang”,则可以定义两个变量,变量的类型为 VERB。

The caret(补字符号), “^”, at the beginning of the pattern makes the pattern match only at the beginning of an input line.—如果有这样一行”verb adj  vfg  vgang”会怎么样?将adj当作一个verb类型的变量吗??应该是的。

遇到一行的结尾时,令state = LOOKUP,从而在新的一行中如果直接出现变量名字,则可以调用lookup_word函数产找该变量是否已经被添加进了表里面,找到了就输出它的类型。

 

Example2—part3:

main()

{

        yylex();

}

/* define a linked list of words and types */

struct word {

        char *word_name;

        int word_type;

        struct word *next;

};

struct word *word_list; /* first element in word list */

extern void *malloc() ;

 

int

add_word(int type, char *word)

{

         struct word *wp;

        if(lookup_word(word) != LOOKUP) {

                 printf("!!! warning: word %s already defined \n", word);

                 return 0;

        }

/* word not there, allocate a new entry and link it on the list */

        wp = (struct word *) malloc(sizeof(struct word));

wp->next = word_list;

        /* have to copy the word itself as well */

         wp->word_name = (char *) malloc(strlen(word)+1);

        strcpy(wp->word_name, word);

        wp->word_type = type;

        word_list = wp;

        return 1; /* it worked */

}

int

lookup_word(char *word)

{

 struct word *wp = word_list;

        /* search down the list looking for the word */

        for(; wp; wp = wp->next) {

                 if(strcmp(wp->word_name, word) == 0)

                          return wp->word_type;

        }

        return  LOOKUP; /* not found */

}

 

解析:这两个函数用于添加变量到链表,从链表中搜索变量。搜索方法比较慢,在实际情况下,可以使用哈希表等高速方法。

 

编译运行的方法同例子1,运行实例:编译出程序后,开始运行,然后输入字符串:

其中,黑体表示用户从屏幕上输入的字符,细体表示程序的输出。


例子1的.l文件:

%{

/*
  sample example.
*/
int
yywrap(void)
{
    return 1;
}
%}
%%

[/t ]+  /*ignore white space*/;
is |
am |
did |
go {printf("%s: is a verb\n",yytext);}
[a-zA-Z]+  {printf("%s: is not a verb\n",yytext);}

.|\n {ECHO;}
%%
main()
{
  yylex();
}

 

例子2的.l文件

%{
/*
 * Word recognizer with a symbol table.
 */
enum {
 LOOKUP =0, /* default - looking rather than defining. */
 VERB,
 ADJ,
 ADV,
 NOUN,
 PREP,
 PRON,
 CONJ
};
int state;
int add_word(int type, char *word);
int lookup_word(char *word);

int
yywrap(void)
{
    return 1;
}

%}

%%
\n  { state = LOOKUP; } /* end of line, return to default state */

 /* whenever a line starts with a reserved part of speech name start defining words of that type */
^verb  { state = VERB; }
^adj  { state = ADJ; }
^adv  { state = ADV; }
^noun  { state = NOUN; }
^prep  { state = PREP; }
^pron  { state = PRON; }
^conj  { state = CONJ; }

[a-zA-Z]+ {
 	/* a normal word, define it or look it up */
 	if(state != LOOKUP) {
 	/* define the current word */
 		add_word(state, yytext);
 	} 
else {
	switch(lookup_word(yytext)) {
 		case VERB: printf("%s: verb\n", yytext); break;
 		case ADJ: printf("%s: adjective\n", yytext); break;
 		case ADV: printf("%s: adverb\n", yytext); break;
 		case NOUN: printf("%s: noun\n", yytext); break;
 		case PREP: printf("%s: preposition\n", yytext); break;
 		case PRON: printf("%s: pronoun\n", yytext); break;
 		case CONJ: printf("%s: conjunction\n", yytext); break;
 	default:
 		printf("%s: don't recognize\n", yytext);
 	break;
 	}
}
 }
.  /* ignore anything else */ ;
%%
main()
{
 	yylex();
}
/* define a linked list of words and types */
struct word {
 	char *word_name;
 	int word_type;
 	struct word *next;
};
struct word *word_list; /* first element in word list */
extern void *malloc() ;

int
add_word(int type, char *word)
{
	struct word *wp;
 	if(lookup_word(word) != LOOKUP) {
 		printf("!!! warning: word %s already defined \n", word);
 		return 0;
 	}
/* word not there, allocate a new entry and link it on the list */
 	wp = (struct word *) malloc(sizeof(struct word));
	wp->next = word_list;
 	/* have to copy the word itself as well */
	wp->word_name = (char *) malloc(strlen(word)+1);
 	strcpy(wp->word_name, word);
 	wp->word_type = type;
 	word_list = wp;
 	return 1; /* it worked */
}
int
lookup_word(char *word)
{
 	struct word *wp = word_list;
 	/* search down the list looking for the word */
 	for(; wp; wp = wp->next) {
 		if(strcmp(wp->word_name, word) == 0)
 			return wp->word_type;
 	}
 	return  LOOKUP; /* not found */
}

 

Ref:

《lex and yacc--second edition》 – 作者:John R. Levine , chapter 1

作者: 胡彦 本框架是一个lex/yacc完整的示例,用于学习lex/yacc程序基本的搭建方法,在linux/cygwin下敲入make就可以编译和执行。 本例子虽小却演示了lex/yacc程序最常见和重要的特征: * lex/yacc文件格式、程序结构。 * 如何在lex/yacc中使用C++和STL库,用extern "C"声明那些lex/yacc生成的、要链接的C函数,如yylex(), yywrap(), yyerror()。 * 重定义YYSTYPE/yylval为复杂类型。 * 用%token方式声明yacc记号。 * 用%type方式声明非终结符的类型。 * lex里正则表达式的定义、识别方式。 * lex里用yylval向yacc返回属性值。 * 在yacc嵌入的C代码动作里,对记号属性($1, $2)、和非终结符属性($$)的正确引用方法。 * 对yyin/yyout重赋值,以改变yacc默认的输入/输出目标。 * 如何开始解析(yyparse函数),结束或继续解析(yywrap函数)。 本例子功能是,对当前目录下的file.txt文件,解析出其中的标识符、数字、其它符号,显示在屏幕上。linux调试环境是Ubuntu 10.04。 总之,大部分框架已经搭好了,你只要稍加扩展就可以成为一个计算器之类的程序,用于《编译原理》的课程设计。 文件列表: lex.l: lex程序文件。 yacc.y: yacc程序文件。 main.hpp: 共同使用的头文件。 Makefile: makefile文件。 file.txt: 给程序解析的文本文件。 使用方法: 1-lex_yacc_example.rar解压到linux/cygwin下。 2-命令行进入lex_yacc_example目录。 3-敲入make,这时会自动执行以下操作: (1) 自动调用flex编译.l文件,生成lex.yy.c文件。 (2) 自动调用bison编译.y文件,生成yacc.tab.c和yacc.tab.h文件。 (3) 自动调用g++编译、链接出可执行文件main。 (4) 自动执行main,得到如下结果:。 bison -d yacc.y g++ -c lex.yy.c g++ -c yacc.tab.c g++ lex.yy.o yacc.tab.o -o main id: abc id: defghi int: 123 int: 45678 op: ! op: @ op: # op: $ AllId: abc defghi 参考资料:《Lex和Yacc从入门到精通(6)-解析C-C++包含文件》, http://blog.youkuaiyun.com/pandaxcl/article/details/1321552 其它文章和代码请留意我的blog: http://blog.youkuaiyun.com/huyansoft 2013-4-27
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

First Snowflakes

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值