awk 内嵌正则 提取字符串,使用awk将特定子字符串与正则表达式匹配

博客内容涉及在Bash脚本中使用正则表达式从特定格式的文件名中提取信息。作者指出,文件名结构包含一个可选的子字符串,该子字符串以-W开头,并可能具有特定的数字格式。目前的gawk尝试无法正确地忽略这个可选部分。解决方案建议使用grep-P,它支持look-around特性,能够有效地提取所需字符串。

I'm dealing with a specific filenames, and need to extract information from them.

The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"

with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".

The information I need to extract is the substring of RANDOMSTR without this optional substring.

I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:

gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"

OTHER-STRING-W0.40+045

The expected results are:

gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"

SOME-STRING

gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"

OTHER-STRING

How can I get the desired effect.

Thanks.

解决方案

You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.

$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'

$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"

SOME-STRING

$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"

OTHER-STRING

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符  | 博主筛选后可见
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值