Tag matching using Regex

本文介绍如何使用正则表达式匹配不同类型的tag,包括单字符和多字符tag,并探讨了如何处理嵌套tag的问题。文章还提供了优化建议及Python性能测试结果。

 

Single-character tag matching

我们通常会碰到对字符串中的被tag包围的内容进行匹配的情况,比如

abc "Hello, world"

我们需要匹配""中的字符串,那么可以使用如下的regex

    "[^"]*"

这里我们使用贪婪匹配来匹配字符串,为了防止过度匹配的情况发生,我们使用了排除型字符组,即[^"]来匹配引号中的内容。这里我们还可以用忽略优先来匹配

    ".*?"

因为忽略优先的特性,所以不会出现过度匹配的情况,即不会匹配

 abc "hello, world" "hello, boy"

Multi-characters tag matching

当我们要匹配不是简单的引号中的内容时,即如果一个tag包含多个字符,这种情况,使用简单的排除型字符组就不奏效了。比如HTML/XML中,所有element都是多字符的。

<b>hello</b>

这里,如果我们简单地用

    <b>[^</b>]*</b>

是不行的,因为这里[^</b>]不是说不匹配</b>这个tag,而是不匹配<,/,b,>这些字符,所以以下合法的内容将不能匹配

    <b>/abc</b>

这里我们需要检查的是,当一个字符不是</b>时,才会去匹配,也就是跟单字符tag匹配时一样,这次需要排除</b>而已。我们可以用Lookaround来达到这个目的。

    <b>((?!</b>).)*</b>     (1)

Lookaround会迫使正则引擎检查当前位置是否是</b>,如果是,则不匹配,于是会尝试匹配</b>。这里我们需要注意,不能写成

    <b>(.(?!</b>))*</b>

这个表达式意义同上面那个不一样,这里会先匹配任意字符,之后匹配后,检查当前位置是否是</b>

 

同单字符tag匹配一样,我们也可以使用忽略优先来匹配:

    <b>.*?</b>              (2)

这里不需要排除</b>,理由同单字符tag匹配时一样。

 

Handling Nested Tags

上面的方案无法解决有嵌套tag的情况,比如

<b>hello<b>world</b>

如果采用(1),那么很显然,我们这里会匹配整个字符串,但是<b>hello通常不是我们的期望匹配。而(2)同样也无法解决,还是会过度匹配。

这里其实我们需要匹配的是除了所有starting tag,和ending tag之外的字符,也就是说我们需要

    start(except start, end)*end

对于贪婪匹配,我们必须同时排除startingending tag

  • 排除starting tag是为了防止匹配嵌套的tag
  • 排除ending tag是为了防止过度匹配. e.g. <b>blahblah</b>foo</b>
    <b>((?!<b>|</b>).)*</b> => <b>((?!</?b>).)*</b>

对于非贪婪匹配,我们只需排除starting tag即可,因为我们总是不会过度匹配</b>

    <b>((?!<b>).)*?</b>


Optimization

匹配时我们已经可以使用忽略优先来匹配tags了,但是对于某些tag,我们还可以做得更好,比如C comment。C的注释形如/*...*/,其中/**/不能嵌套,我们当然可以用忽略优先来构造,得到:

/\*((?!/\*).)*?\*/

但其实,我们也可以更直接,使用:

/\*.*?\*/

这里能够这样简单地使用忽略优先来完成匹配的原因是:/**/不能嵌套,这就保证了,里面不会出现/*,所以我们可以不用排除/*。

用python做性能测试结果表明,第二个表达式的效率比第一个表达式提升约69%。

Conclusion

对于tag匹配的问题,我们得到了两种解决方案

  • 使用贪婪匹配
    start((?!start|end).)*end
  • 使用非贪婪匹配,且start和end之间可能存在嵌套
    start((?!start).)*?end
  • 使用非贪婪匹配,且start和end之间不会存在嵌套
    start.*?end

对于single-character tag,我们只需用简单的排除型字符组替换Lookaround即可,得到

  • 使用贪婪匹配
    s[^se]*e
  • 使用非贪婪匹配
    s[^s]*?e

 Note

Python中的正则表达式不支持在Lookaround中使用非固定长度的表达式,需要注意tag的构造。

 Reference

[1] Mastering Regular Expression

 

C:\Users\g60042218>hdc hilog -h Usage: -h --help Show all help information. Show single help information with option: query/clear/buffer/stats/persist/private/kmsg/flowcontrol/baselevel/domain/combo Querying logs options: No option performs a blocking read and keeps printing. -x --exit Performs a non-blocking read and exits when all logs in buffer are printed. -a <n>, --head=<n> Show n lines logs on head of buffer. -z <n>, --tail=<n> Show n lines logs on tail of buffer. -t <type>, --type=<type> Show specific type/types logs with format: type1,type2,type3 Don't show specific type/types logs with format: ^type1,type2,type3 Type coule be: app/core/init/kmsg/only_prerelease, kmsg can't combine with others. Default types are: app,core,init,only_prerelease. -L <level>, --level=<level> Show specific level/levels logs with format: level1,level2,level3 Don't show specific level/levels logs with format: ^level1,level2,level3 Long and short level string are both accepted Long level string coule be: DEBUG/INFO/WARN/ERROR/FATAL. Short level string coule be: D/I/W/E/F. Default levels are all levels. -D <domain>, --domain=<domain> Show specific domain/domains logs with format: domain1,domain2,doman3 Don't show specific domain/domains logs with format: ^domain1,domain2,doman3 Max domain count is 5. See domain description at the end of this message. -T <tag>, --tag=<tag> Show specific tag/tags logs with format: tag1,tag2,tag3 Don't show specific tag/tags logs with format: ^tag1,tag2,tag3 Max tag count is 10. -P <pid>, --pid=<pid> Show specific pid/pids logs with format: pid1,pid2,pid3 Don't show specific domain/domains logs with format: ^pid1,pid2,pid3 Max pid count is 5. -e <expr>, --regex=<expr> Show the logs which match the regular expression <expr>. -v <format>, --format=<format> Show logs in different formats, options are: color or colour display colorful logs by log level.i.e. DEBUG INFO WARN ERROR FATAL time format options are(single accepted): time display local time, this is default. epoch display the time from 1970/1/1. monotonic display the cpu time from bootup. time accuracy format options are(single accepted): msec display time by millisecond, this is default. usec display time by microsecond. nsec display time by nanosecond. year display the year when -v time is specified. zone display the time zone when -v time is specified. wrap display the log without prefix when a log line is wrapped. long display all metadata fields, separate messages with blank lines. Different types of formats can be combined, such as: -v color -v time -v msec -v year -v zone. -r Remove all logs in hilogd buffer, advanced option: -t <type>, --type=<type> Remove specific type/types logs in buffer with format: type1,type2,type3 Type coule be: app/core/init/kmsg/only_prerelease. Default types are: app,core,only_prerelease -g Query hilogd buffer size, advanced option: -t <type>, --type=<type> Query specific type/types buffer size with format: type1,type2,type3 Type coule be: app/core/init/kmsg/only_prerelease. Default types are: app,core,only_prerelease -G <size>, --buffer-size=<size> Set hilogd buffer size, <size> could be number or number with unit. Unit could be: B/K/M/G which represents Byte/Kilobyte/Megabyte/Gigabyte. <size> range: [64.0K,16.0M]. Advanced option: -t <type>, --type=<type> Set specific type/types log buffer size with format: type1,type2,type3 Type coule be: app/core/init/kmsg/only_prerelease. Default types are: app,core,only_prerelease **It's a persistant configuration** -s, --statistics Query log statistic information. Set param persist.sys.hilog.stats true to enable statistic. Set param persist.sys.hilog.stats.tag true to enable statistic of log tag. -S Clear hilogd statistic information. -w <control>,--write=<control> Log persistance task control, options are: query query tasks informations stop stop all tasks start start one task refresh refresh buffer content to file clear clear /data/log/hilog/hilog*.gz Persistance task is used for saving logs in files. The files are saved in directory: /data/log/hilog/ Advanced options: -f <filename>, --filename=<filename> Set log file name, name should be valid of Linux FS. -l <length>, --length=<length> Set single log file size. <length> could be number or number with unit. Unit could be: B/K/M/G which represents Byte/Kilobyte/Megabyte/Gigabyte. <length> range: [64.0K, 512.0M]. -n <number>, --number<number> Set max log file numbers, log file rotate when files count over this number. <number> range: [2, 1000]. -m <compress algorithm>,--stream=<compress algorithm> Set log file compressed algorithm, options are: none write file with non-compressed logs. zlib write file with zlib compressed logs. -j <jobid>, --jobid<jobid> Start/stop specific task of <jobid>. <jobid> range: [10, 0xffffffff). User can start task with options (t/L/D/T/P/e/v) as if using them when "Query logs" too. **It's a persistant configuration** -p <on/off>, --privacy <on/off> Set HILOG api privacy formatter feature on or off. **It's a temporary configuration, will be lost after reboot** -k <on/off>, --kmsg <on/off> Set hilogd storing kmsg log feature on or off **It's a persistant configuration** -Q <control-type> Set log flow-control feature on or off, options are: pidon process flow control on pidoff process flow control off domainon domain flow control on domainoff domain flow control off domainverifyon enable non whitelist domain domainverifyoff disable non whitelist domain **It's a temporary configuration, will be lost after reboot** -b <loglevel>, --baselevel=<loglevel> Set global loggable level to <loglevel> Long and short level string are both accepted. Long level string coule be: DEBUG/INFO/WARN/ERROR/FATAL/X. Short level string coule be: D/I/W/E/F/X. X means that loggable level is higher than the max level, no log could be printed. Advanced options: -D <domain>, --domain=<domain> Set specific domain loggable level. See domain description at the end of this message. -T <tag>, --tag=<tag> Set specific tag loggable level. The priority is: tag level > domain level > global level. **It's a temporary configuration, will be lost after reboot** --persist Set persist configuration The priority is: tag level > persist tag level > domain level > persist domain level > global level > persist global level. The first layer options can't be used in combination, ILLEGAL expamples: hilog -S -s; hilog -w start -r; hilog -p on -k on -b D Domain description: Log type "core" & "init" & "only_prerelease" are used for OS subsystems, the range is [0xd000000, 0xd0fffff] Log type "app" is used for applications, the range is [0x0, 0xffff] To reduce redundant info when printing logs, only last five hex numbers of domain are printed So if user wants to use -D option to filter OS logs, user should add 0xD0 as prefix to the printed domain: Exapmle: hilog -D 0xD0xxxxx The xxxxx is the domain string printed in logs. Dictionary description: -d <path>, --dictionary=<path> Set elf file path, name should be valid of Linux FS. Rescan the elf file in the system to generate a full data dictionary file翻译一下
最新发布
12-11
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值