词法分析器

最新推荐文章于 2022-03-02 17:40:37 发布

原创最新推荐文章于 2022-03-02 17:40:37 发布 · 2.2k 阅读

6 ·

CC 4.0 BY-SA版权

编译原理专栏收录该内容

3 篇文章

订阅专栏

该博客介绍了词法分析器的功能，它将源程序分解为单词符号，包括关键字、标识符、常量、运算符和界符。实验内容涉及输入源代码，输出token表和error表。实现原理通过状态转换图，程序源码包含在main.py和source.txt中。总结提到，程序设计语言的符号具有固定含义且上下文无关。

实验介绍

词法分析器的功能是:输入源程序，按照构词规则分解成一系列单词符号。
单词是语言中具有独立意义的最小单位，包括关键字、标识符、运算符、界符和常量等
(1) 关键字是由程序语言定义的具有固定意义的标识符。例如，Pascal 中的begin，end，if，while都是保留字。这些字通常不用作一般标识符。
(2) 标识符用来表示各种名字，如变量名，数组名，过程名等等。
(3) 常数常数的类型一般有整型、实型、布尔型、文字型等。
(4) 运算符如+、-、*、/等等。
(5) 界符如逗号、分号、括号、等等。

实验内容

输入: 给定一段源程序代码

int a = 1;
while(a <= 10a){
    a++;
}

输出: token 表和 error表, 两个表都是三元组列表(行数, 单词, 类型). token表记录分析出的单词,error表记录程序代码的错误.

实现原理

c语言子集对应的状态转换图
这里写图片描述

程序源码

source.txt 中存放要分析的源程序

int a = 1;
while(a <= 10a){
    a++;
}

main.py 存放分析程序代码

def is_alphabet(c):
    # 判断是否是字母
    if 'a' <= c <= 'z' or 'A' <= c <= 'Z':
        return True
    else:
        return False


def is_digit(c):
    # 判断是否是数字
    if '0' <= c <= '9':
        return True
    else:
        return False


def is_operator(c):
    # 判断是否是运算符
    operator = [
        '+',
        '-',
        '*',
        '/',
        '=',
        '<',
        '>',
    ]
    if c in operator:
        return True
    else:
        return False


def is_separator(c):
    separator = [
        '(',
        ')',
        '{',
        '}',
        ';',
    ]
    if c in separator:
        return True
    else:
        return False


def process_word(n, s):
    # 是否关键字
    key_words = ['int', 'while']
    if s in key_words:
        return n, s, '关键字'
    else:
        return n, s, '保留字'


def process_digit(n, s):
    return n, s, '常数'


if __name__ == '__main__':
    result = list()
    errors = list()
    line_number = 1
    for line in open('source.txt', 'r'):
        index = 0
        while index < len(line):
            # 取下一个字符
            char = line[index]
            token = [char, ]
            if is_alphabet(char):
                # 字母开头，则提取后面的字符串：
                while True:
                    c = line[index + 1]
                    if is_alphabet(c) or is_digit(c):
                        token.append(c)
                        index += 1
                    elif c.strip() == '' or is_operator(c) or is_separator(c):
                        break
                    else:
                        print('出错了！', c)
                        break
                # 处理token
                result.append(process_word(line_number, ''.join(token)))
            if is_digit(char):
                flag = True
                # 数字开头
                while True:
                    c = line[index + 1]
                    if is_digit(c):
                        token.append(c)
                        index += 1
                    elif is_alphabet(c):
                        flag = False
                        token.append(c)
                        index += 1
                    else:
                        break
                if flag:
                    result.append(process_digit(line_number, ''.join(token)))
                else:
                    errors.append((line_number, ''.join(token), "标识符不能数字开头"))
            if is_separator(char):
                # 分隔符
                result.append((line_number, char, '分隔符'))
                index += 1
            if is_operator(char):
                c = line[index + 1]
                if is_operator(c):
                    token.append(c)
                    index += 1
                # 分隔符
                result.append((line_number, ''.join(token), '运算符'))
                index += 1
            index += 1
        line_number += 1

    print('==================token表=====================')
    for item in result:
        print(item)
    print('==================error表=====================')
    for item in errors:
        print(item)