词法分析生成器之 lexertl 【4】添加文件解析行号功能-优快云博客

本文介绍如何利用Boost.Spirit中的file_iterator和position_iterator类来增强lexertl库的功能，使得在进行词法分析时能输出包含文件名及行号在内的详细Token信息。

目标：将文件名和行号信息存在Token中以便词法分析和语法分析时输出更详细的信息。这在调试你的分析器时将会有非常大帮助。

做法：记得之前 Boost.Spirit 有一个 file_iterator类和position_iterator类，仔细看了一下，确实满足 lexertl match_results类对迭代器的要求。好，那就写几行代码验证一下吧。

#include "lexertl/generator.hpp" #include "lexertl/lookup.hpp" #include "lexertl/rules.hpp" #include "lexertl/state_machine.hpp" #include <boost/spirit/home/classic/iterator/file_iterator.hpp> #include <boost/spirit/home/classic/iterator/position_iterator.hpp> #include <iostream> #include <string> //直接使用boost.spirit中定义的file_iterator和 position_iterator namespace SPIRIT_CLASSIC = boost::spirit::classic; typedef SPIRIT_CLASSIC::file_iterator<char> file_iterator_type; typedef SPIRIT_CLASSIC::position_iterator2<file_iterator_type> position_iterator_type; int main() { try { lexertl::rules rules_; lexertl::state_machine state_machine_; rules_.add ("[0-9]+", 1); rules_.add ("[a-zA-Z]+", 2); lexertl::generator::build (rules_, state_machine_); //将文件名作为参数传入到file_iterator中。 file_iterator_type iterFile("test.txt"); if ( !iterFile ) { std::cout<<"Open file test.txt fail!"<<std::endl; return (-1); } //lexertl.lookup 要求输入两个迭代器参数,作为输入的起始和结束。 //我们构造两个迭代器不仅包括输入文件内容还包含了行列信息。 position_iterator_type iterBegin( iterFile, iterFile.make_end() ); //迭代器起始位置 position_iterator_type iterEnd; //迭代器结束位置 //剩下的事就交给 lexertl 处理吧 lexertl::match_results<position_iterator_type> results_ (iterBegin, iterEnd); std::cout<<"Start parse file test.txt"<<std::endl; do { //输出token信息 lexertl::lookup(state_machine_, results_); SPIRIT_CLASSIC::file_position posStart = results_.start.get_position(); SPIRIT_CLASSIC::file_position posEnd = results_.end.get_position(); std::cout <<"Token Id : "<<results_.id<<std::endl <<"Token String : "<<std::string (results_.start, results_.end)<<std::endl <<"Token Position : ("<<posStart.line<<"."<<posStart.column<<" -> "<<posEnd.line<<"."<<posEnd.column<<")/n" <<std::endl; } while (results_.id != rules_.eoi()); } catch(const std::exception & e) { std::cout<<"<Error> Exception: "<<e.what()<<std::endl; } return 0; }

test.txt 文件内容为: abcd1234TTTT

运行结果如下：

可以看到，已经正确地解析出了3个token，并且输出起始行列号与介绍行列号信息。

lexert 作者 Ben Hanson 似乎正准备自己为lexetl定义一个file_iterator 用于取代Boost.Spirit中 file_iterator。这里我将Ben Hanson的Blog拷贝了过来。如果真的另外开发一个file_iterator，我们期待在编译速度以及运行性能上能够超过Boost.Spirit中file_iterator……

The lexertl Blog

29.09.2009

As I have recently started a revamp of lexertl I have decided to start a blog to keep everybody up to date. As this version is not feature complete yet, I have added a separate zip file which you can find here.

So far I have implemented the following improvements:

Auto compression of wchar_t based state machines (overridable).
A generic lookup mechanism based around iterators.
Added the lexertl::skip token constant.
Removed regex macro length limitation.
Made the BOL (^) link a singleton (as it can only occur at the beginning of a token).
debug::dump() now compresses ranges.

This dramatically reduces the list of (easier) features I wanted to add and just leaves the following for the immediate future: