Introducing Regular Expressions
正则表达式以前是在Unix系统使用的命令,现在已经被越来越多的平台所支持用来搜索和替换字符串。
本文旨在描述一些关于正则表达式的一些语法。
Matching Sets of Characters
如果需要匹配多个字符中的任意一个,可以将他们放在一个 [ ] 中。但是请注意,只有[]中的一个字符会被用于匹配。
例如数字[1234567890]只有其中的一个数字会用于匹配。
当然例如之前的数字匹配也有一种简单的使用格式如[0-9]。
如果我们需要匹配A-Z的英文字符,可以用[A-Z]
如果我们需要匹配a-z的英文字符,可以用[a-z]
^ 表示非集,也就是除了此范围中的任意字符如[^a-zA-Z]表示除了英文字母的任意字符。
Using Metacharacters
正则表达式中同样也有表示特殊含义的转义字符。如下
Metacharacter
|
Description
|
[/b]
|
Backspace
|
/f
|
Form feed
|
/n
|
Line feed
|
/r
|
Carriage return
|
/t
|
Tab
|
/v
|
Vertical tab
|
/d
|
Any digit (same as [0-9])
|
/D
|
Any nondigit (same as [^0-9])
|
/w
|
Any alphanumeric character in upper- or lower-case and underscore (same as [a-zA-Z0-9_])
|
/W
|
Any nonalphanumeric or underscore character (same as [^a-zA-Z0-9_])
|
/s
|
Any whitespace character (same as [/f/n/r/t/v])
|
/S
|
Any nonwhitespace character (same as [^/f/n/r/t/v])
|
[:alnum:]
|
Any letter or digit, (same as [a-zA-Z0-9])
|
[:alpha:]
|
Any letter (same as [a-zA-Z])
|
[:blank:]
|
Space or tab (same as [/t ])
|
[:cntrl:]
|
ASCII control characters (ASCII 0 through 31 and 127)
|
[:digit:]
|
Any digit (same as [0-9])
|
[:graph:]
|
Same as [:print:] but excludes space
|
[:lower:]
|
Any lowercase letter (same as [a-z])
|
[:print:]
|
Any printable character
|
[:punct:]
|
Any character that is neither in [:alnum:] nor [:cntrl:]
|
[:space:]
|
Any whitespace character including space (same as [/f/n/r/t/v ])
|
[:upper:]
|
Any uppercase letter (same as [A-Z])
|
[:xdigit:]
|
Any hexadecimal digit (same as [a-fA-F0-9])
|
Repeating Matches
对于重复出现的字符串,在正则表达式中可以使用
+ 代表1到多个 如 [0-9]+
* 代表0到多个 如 [0-9]*
? 代表0个或1个 如[0-9]?
{3}代表3个
{1,3}代表1个到3个
{,3} 代表0个到3个
{3,}代表3各以上
特别需要指出的是
*? , +?, {n,}? 分别表示最小的匹配
例如:
This offer is not available to customers living in <B>AK</B> and <B>HI</B>.
如果我们需要获取HTML标记中的粗体字部分的话,一般而言我们会用<[Bb]>.*</[Bb]>
但是得到的结果却是<B>AK</B> and <B>HI</B>,将整个从<B开头到/B>结尾的内容都返回出来了。
很明显这不是我们要的,那么怎么解决这样的问题呢?
这个时候我们就需要使用到Lazy Quantifiers。
使用<[Bb]>.*?</[Bb]>就能返回
<B>AK</B>
<B>HI</B>
Using Subexpressions
在查找/替换的表达式中,我们还可以使用()的方式引用到之前的内容。
例如:
313-555-1234
248-555-9999
810-555-9000
这样的美国电话号码。
我们可以使用以下的正则表达式
(/d{3})(-)(/d{3})(-)(/d{4})
($1) $3-$5
替换为
(313) 555-1234
(248) 555-9999
(810) 555-9000
替换中同样可以使用到一些转义字符如下
Metacharacter
|
Description
|
/E
|
Terminate /L or /U conversion
|
/l
|
Convert next character to lowercase
|
/L
|
Convert all characters up to /E to lowercase
|
/u
|
Convert next character to uppercase
|
/U
|
Convert all characters up to /E to uppercase
|
<BODY>
<H1>Welcome to my Homepage</H1>
Content is divided into two sections:<BR>
<H2>ColdFusion</H2>
Information about Macromedia ColdFusion.
<H2>Wireless</H2>
Information about Bluetooth, 802.11, and more.
<H2>This is not valid HTML</H3>
</BODY>
(<[Hh]1>)(.*?)(</[Hh]1>)
$1/U$2/E$3
<BODY>
<H1>WELCOME TO MY HOMEPAGE</H1>
Content is divided into two sections:<BR>
<H2>ColdFusion</H2>
Information about Macromedia ColdFusion.
<H2>Wireless</H2>
Information about Bluetooth, 802.11, and more.
<H2>This is not valid HTML</H3>
</BODY>
Looking Ahead and Behind
?=用来检查下一个字符,但是并不返回此字符。如:
http://www.forta.com/
.+(?=:)
http
?<=用来检查前一个字符,但是并不返回此字符。如:
ABC01: $23.45
(?<=/$)[0-9.]+
23.45
Class
|
Description
|
(?=)
|
Positive lookahead
|
(?!)
|
Negative lookahead
|
(?<=)
|
Positive lookbehind
|
(?<!)
|
Negative lookbehind
|
常用的正则表达式
IP地址:
(((/d{1,2})|(1/d{2})|(2[0-4]/d)|(25[0-5]))/.){3}((/d{1,2})|(1/d{2})|(2[0-4]/d)|(25[0-5]))
URLs:
https?://[-/w.]+(:/d+)?(/([/w/_.]*)?)?
https?://(/w*:/w*@)?[-/w.]+(:/d+)?(/([/w/_.]*(/?/S+)?)?)?
Email Addresses
(/w+/.)*/w+@(/w+/.)+[A-Za-z]+
Hope this helps.