Difference between [0-9], [[:digit:]] and \d

正则表达式中d与[0-9]的区别
本文详细解析了正则表达式中d与[0-9]的区别,d在多数编程语言中代表所有Unicode数字,包括ASCII数字和其他语言的数字字符;而[0-9]仅匹配ASCII数字0到9。文章通过Perl示例展示了不同数字字符的匹配,并解释了在POSIX和不同语言环境下的具体表现。

Yes, it is [[:digit:]] ~ [0-9] ~ \d (where ~ means aproximate).
In most programming languages (where it is supported) \d ≡ [[:digit:]] (identical).
The \d is less common than [[:digit:]] (not in POSIX but it is in GNU grep -P).

There are many digits in UNICODE, for example:

123456789 # Hindu-Arabic Arabic numerals
٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI

All of which may be included in [[:digit:]] or \d.

Instead, [0-9] is generally only the ASCII digits 0123456789.


There are many languages: Perl, Java, Python, C. In which [[:digit:]] (and \d) calls for an extended meaning. For example, this perl code will match all the digits from above:

$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

$ echo "$a" | perl -C -pe 's/[^\d]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९

Which is equivalent to select all characters that have the Unicode properties of Numeric and digits:

$ echo "$a" | perl -C -pe 's/[^\p{Nd}]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९

Which grep could reproduce (the specific version of pcre may have a diferent internal list of numeric code points than Perl):

$ echo "$a" | grep -oP '\p{Nd}+'
0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९

Change it to [0-9] to see:

$ echo "$a" | grep -o '[0-9]\+'
0123456789

POSIX

For the specific POSIX BRE or ERE:
The \d is not supported (not in POSIX but is in GNU grep -P). [[:digit:]] is required by POSIX to correspond to the digit character class, which in turn is required by ISO C to be the characters 0 through 9 and nothing else. So only in C locale all [0-9][0123456789]\d and [[:digit:]] mean exactly the same. The [0123456789] has no possible misinterpretations, [[:digit:]] is available in more utilities and it is common to mean only [0123456789]. The \d is supported by few utilities.

As for [0-9], the meaning of range expressions is only defined by POSIX in the C locale; in other locales it might be different (might be codepoint order or collation order or something else).

转载于:https://www.cnblogs.com/kakaisgood/p/9645277.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值