Lex匹配unicode字符

本文介绍如何使用Lex和Yacc实现支持Unicode的JSON解析器。通过定义特定的正则表达式匹配Unicode字符,确保了JSON字符串类型的正确解析。

想用lex&yacc写一个json的解析, 而json的string类型是包含unicode的, 词法解析工具Lex是不直接支持unicode字符匹配的, 那如果要想匹配unicode字符应该怎么办呢, 在stack overflow上看到一个很好的解答: http://stackoverflow.com/questions/9611682/flexlexer-support-for-unicode.

基本思想就是unicode字符写一个匹配模式,

ASC     [\x00-\x7f]
ASCN    [\x00-\t\v-\x7f]
U       [\x80-\xbf]
U2      [\xc2-\xdf]
U3      [\xe0-\xef]
U4      [\xf0-\xf4]

UANY    {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UANYN   {ASCN}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U} 
UONLY   {U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}

上面匹配模式的意义如下:

UANY: 匹配unicode和ascii字符
UANYN: 与UANY类似, 只是不匹配换行符
UONLY: 只匹配unicode字符, 不匹配ascii字符

DISCLAIMER: Note that the scanner’s rules use a function called
utf8_dup_from to convert the yytext to wide character strings
containing Unicode codepoints. That function is robust; it detects
problems like overlong sequences and invalid bytes and properly
handles them. I.e. this program is not relying on these lex rules to
do the validation and conversion, just to do the basic lexical
recognition. These rules will recognize an overlong form (like an
ASCII code encoded using several bytes) as valid syntax, but the
conversion function will treat them properly. In any case, I don’t
expect UTF-8 related security issues in the program source code, since
you have to trust source code to be running it anyway (but data
handled by the program may not be trusted!) If you’re writing a
scanner for untrusted UTF-8 data, take care!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值