Lex匹配unicode字符

最新推荐文章于 2024-09-11 08:24:10 发布

原创最新推荐文章于 2024-09-11 08:24:10 发布 · 2.3k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#lex #json #unicode

Linux 专栏收录该内容

49 篇文章

订阅专栏

本文介绍如何使用Lex和Yacc实现支持Unicode的JSON解析器。通过定义特定的正则表达式匹配Unicode字符，确保了JSON字符串类型的正确解析。

想用lex&yacc写一个json的解析, 而json的string类型是包含unicode的, 词法解析工具Lex是不直接支持unicode字符匹配的, 那如果要想匹配unicode字符应该怎么办呢, 在stack overflow上看到一个很好的解答: http://stackoverflow.com/questions/9611682/flexlexer-support-for-unicode.

基本思想就是unicode字符写一个匹配模式,

ASC     [\x00-\x7f]
ASCN    [\x00-\t\v-\x7f]
U       [\x80-\xbf]
U2      [\xc2-\xdf]
U3      [\xe0-\xef]
U4      [\xf0-\xf4]

UANY    {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UANYN   {ASCN}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U} 
UONLY   {U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}

上面匹配模式的意义如下:

UANY: 匹配unicode和ascii字符
UANYN: 与UANY类似, 只是不匹配换行符
UONLY: 只匹配unicode字符, 不匹配ascii字符

DISCLAIMER: Note that the scanner’s rules use a function called
utf8_dup_from to convert the yytext to wide character strings
containing Unicode codepoints. That function is robust; it detects
problems like overlong sequences and invalid bytes and properly
handles them. I.e. this program is not relying on these lex rules to
do the validation and conversion, just to do the basic lexical
recognition. These rules will recognize an overlong form (like an
ASCII code encoded using several bytes) as valid syntax, but the
conversion function will treat them properly. In any case, I don’t
expect UTF-8 related security issues in the program source code, since
you have to trust source code to be running it anyway (but data
handled by the program may not be trusted!) If you’re writing a
scanner for untrusted UTF-8 data, take care!