awk 内嵌正则提取字符串,使用awk将特定子字符串与正则表达式匹配

最新推荐文章于 2024-09-30 16:49:02 发布

转载最新推荐文章于 2024-09-30 16:49:02 发布 · 298 阅读

文章标签：

博客内容涉及在Bash脚本中使用正则表达式从特定格式的文件名中提取信息。作者指出，文件名结构包含一个可选的子字符串，该子字符串以-W开头，并可能具有特定的数字格式。目前的gawk尝试无法正确地忽略这个可选部分。解决方案建议使用grep-P，它支持look-around特性，能够有效地提取所需字符串。

I'm dealing with a specific filenames, and need to extract information from them.

The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"

with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".

The information I need to extract is the substring of RANDOMSTR without this optional substring.

I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:

gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"

OTHER-STRING-W0.40+045

The expected results are:

gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"

SOME-STRING

gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"

OTHER-STRING

How can I get the desired effect.

Thanks.

解决方案

You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.

$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'

$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"