标准纯C++实现简单的词法分析器(一)

最新推荐文章于 2025-07-08 13:49:14 发布

lonelyforest

最新推荐文章于 2025-07-08 13:49:14 发布

阅读量2.9k

点赞数

CC 4.0 BY-SA版权

分类专栏：技术讨论 C/C++ 文章标签： c++ string file insert list interface

本文链接：https://blog.youkuaiyun.com/lonelyforest/article/details/642521

技术讨论同时被 2 个专栏收录

23 篇文章

订阅专栏

C/C++

22 篇文章

订阅专栏

本文介绍了一种使用C++实现的高效词法分析器的设计思路与部分实现细节，利用vector缓存文件内容提高读取效率，并通过状态机进行标识符提取。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、思路：

使用容器vector<string> 来缓冲文件内容，以便增加效率，以前的总是用getline(FILE*, ...) 或者别的，总之，都要从磁盘不断读取，不断操作，效率肯定不高。这个思路主要是受到《C++ Primer》3 的影响，其中有个文本操作，采用这种方式。

从文件中具体分离出一个个字符，当然也就简单的多了。
   然后再使用状态机，来实现标识符提取，这种方式的优点是可以很容易的根据你的需要来扩充或者修改。而且清晰明了哦。
二、实现(部分)：
1使用标准的纯C++实现，估计可以在 Unix 上兼容。不过没有测试；一下是包含头文件：
#include <fstream>
#include <vector>
#include <string>
#include <iostream>
没有任何 linux 或者 windows 等操作系统特有的东西。
2 tokenizer 类定义。主要实现从文件中提取单个的字符。

/**:    class Tokenizer
*    将源文件直接读入到std::vector<stirng> lines_of_source
*    中，来增加操作上的时间效率，代价是空间！
* 重要接口 char getNextChar() & void unGetNextChar()
*    分别从源文件中提取    return lines_of_source[lineno_-1][linepos++];
*    出一个字符 & 退回一个字符。
*/
class Tokenizer {
public:
    Tokenizer(const std::string& filename);    // use filename.c_str()
    virtual    ~Tokenizer();

    char getNextChar();                // ...primary interface to scanner...
    void unGetNextChar();

    bool    is_good() const { return is_good_; } // very importent
    std::vector<std::string>::size_type    lineno() const // line start with 1
    {    return lineno_; }

protected:
    // interface of insert the trace source messages;
    void    insert_list(const std::string& msg);
    void    insert_list(const char*    msg); //主要是用来保存词法分析产生的文件。

    // read and store file to vector<string>,
    void    read_file(const char* filename);
    void    store_file(ifstream& is);

    // 没有找出一个好的方案来解决 "..." 型的参数，
    // 只能用借用msg_temp[]来调用 sprintf 了
    char msg_temp[512];
    string    source_name;    // record the source file name;
    std::vector<std::string>    list_msg_;    // trace source

    bool    is_good_;    // 帮助测试是否能够正常工作！
private:
    // source file
    std::vector<std::string>    lines_of_source;
    std::vector<std::string>::size_type    lineno_;    // record source file line no

    int        bufsize;    // record a line size
    int        linepos;    // current line position
};
其中借助很多《编译原理及实践》这本书的思想和方法。其中很多东西都是可以不要的，当然
我是为了能够产生一个排版清晰的list文件而加如的。
3 tokenizer 类实现(部分)：

以下两部分将文件读入，并且存储在 lines_of_source中。
/**:    read_file & store_file
*    author:    lonelyforest;
*    data:    2006.03.16
*/
//-----------------------------------------------------------------------------
void Tokenizer::read_file(const char *filename)
{
    ifstream infile(filename);

    if ( !infile )
    {
        is_good_ = false;

        string    errMsg = "Open source file fail!   ";
        errMsg += filename;
        outputMsg(-1, errMsg.c_str());
        outputMsg(-1, "Compile must be stop, and check the file wether exists!");

        insert_list(errMsg);
        insert_list("/nCompile must be stop, and check the file wether exists!/n");
    }
    else
    {
        store_file(infile);
    }
}

void Tokenizer::store_file(ifstream &is)
{
    string linetext;
    while(getline(is, linetext ))
    {
        linetext += "/n";
        lines_of_source.push_back(linetext);
    }
}

以下是两个主要接口函数的实现：
行标号是从1开始的，以免用户不理解。
char Tokenizer::getNextChar()
{
    if ( !(linepos < bufsize ))
    {
        ++lineno_;    // now, lineno_ start with 1 !!

        if ( lineno_ > lines_of_source.size())
        {
            return EOF;
        }

        linepos = 0;
        bufsize = lines_of_source[lineno_ -1].length();

        if ( EchoSource )
        {      // 用于在 list 文件中记录源文件
            sprintf(msg_temp, "%4d: %s", lineno_, lines_of_source[lineno_-1].c_str());
            insert_list(msg_temp);
        }
    }

    return lines_of_source[lineno_-1][linepos++];
}

// 退回一个字符。
void Tokenizer::unGetNextChar()
{
    linepos--;
}

主要标识符分析实现部分待续 ......