Crucial Concepts Behind Advanced Regular Expressions

最新推荐文章于 2020-07-18 09:36:33 发布

转载最新推荐文章于 2020-07-18 09:36:33 发布 · 532 阅读

·

1

·

文章标签：

#regex #import #character #callback #string #nested

Test Automation 专栏收录该内容

3 篇文章

订阅专栏

本文介绍了高级正则表达式的八大核心概念，包括贪婪与懒惰匹配、回溯引用、命名组、单词边界、原子组、递归、回调及注释，通过实例展示了这些概念的应用场景。

Crucial Concepts Behind Advanced Regular Expressions
By Karthik Viswanathan

Regular expressions (or regex) are a powerful way to traverse large strings in order to find information. They rely on underlying patterns in a string’s structure to work their magic. Unfortunately, simple regular expressions are unable to cope with complex patterns and symbols. To deal with this dilemma, you can use advanced regular expressions.

Below, we present an introduction to advanced regular expressions, with eight commonly used concepts and examples. Each example outlines a simple way to match patterns in complex strings. If you do not yet have experience with basic regular expressions, have a look at this article to get started. The syntax used here matches PHP’s Perl-compatible regular expressions.

1. Greediness/Laziness

All regex repetition operators are greedy. They try to match as much as possible in a string. Unfortunately, this might not always be a desired effect. Thus, lazy operators are used to solve this problem. They only match the smallest possible pattern and are used by adding a ‘?’ after the respective greedy operator. Alternatively, the ‘U’ modifier may be used to make all repetiton operators lazy. Differentiating between greediness and laziness is key to fully understanding advanced regular expressions.

Greedy Operators

The * operator matches the previous expression 0 or more times. It is a greedy operator. Consider the following expression:

view plain copy to clipboard print ?

preg_match( '/<h1>.*<//h1>/', '<h1>This is a heading.</h1>
<h1>This is another one.</h1>', $matches );

preg_match( '/<h1>.*<//h1>/', '<h1>This is a heading.</h1>
<h1>This is another one.</h1>', $matches );

Recall that a . means any character except a new line. The above regular expression is looking for an h1 tag and all of its contents. It uses the . and * operators to constantly match anything inside the tag. This pattern will match:

view plain copy to clipboard print ?

<h1>This is a heading.</h1><h1>This is another one.</h1>

<h1>This is a heading.</h1><h1>This is another one.</h1>

It returns the whole string. The * operator will continuously match everything — even the middle closing h1 tag — because it is greedy. Matching the whole string is the best it can do.

Lazy Operators

Let’s change the above operator by adding a ‘?’ after it. This will make it lazy:

view plain copy to clipboard print ?

/<h1>.*?<//h1>/

/<h1>.*?<//h1>/

The regex now fulfills its duty and matches only the first h1 tag. Another greedy operator that uses this same property is {n,}. This matches the previous expression n or more times. If it is used without a question mark, it looks for the most repetitions possible. Otherwise, it starts from n repetitions:

view plain copy to clipboard print ?

# Set up a String
$str = 'hihi';
# Match it using the greedy {n,} operator
preg_match( '/(hi){1,}/', $str, $matches ); # matches[0] will be 'hihi'
# Match it with the lazy {n,}? operator
preg_match( '/(hi){1,}?/', $str, $matches ); # matches[0] will be 'hi'

# Set up a String
$str = 'hihi';

# Match it using the greedy {n,} operator
preg_match( '/(hi){1,}/', $str, $matches ); # matches[0] will be 'hihi'

# Match it with the lazy {n,}? operator
preg_match( '/(hi){1,}?/', $str, $matches ); # matches[0] will be 'hi'

2. Back Referencing

What it does

Back referencing is a way to refer to previously matched patterns inside a regular expression. For example, take a look at this simple regex that matches an expression in quotes:

view plain copy to clipboard print ?

# Set up an array of matches
$matches = array();
# Create a String
$str = "/"This is a 'string'/"";
# Traverse it with regular expressions
preg_match( "/(/"|').*?(/"|')/", $str, $matches );
# Print the whole match
echo $matches[0];

# Set up an array of matches
$matches = array();

# Create a String
$str = "/"This is a 'string'/"";

# Traverse it with regular expressions
preg_match( "/(/"|').*?(/"|')/", $str, $matches );

# Print the whole match
echo  $matches[0];

Unfortunately, this will not correctly match the string. Instead, it will print:

view plain copy to clipboard print ?

"This is a '

"This is a '

This regular expression matches the opening double quote but finds a different type of quote to close it. This is because it was given the option of picking a double or single quote at the end. In order to fix this, you can use back referencing. The expressions /1, /2, …., /9 hold references to previously captured subpatterns. The first matched quote, in this case, will be held by the variable /1.

How to Use It

In order to apply this concept to the aforementioned example, use /1 in place of the last quote:

view plain copy to clipboard print ?

preg_match( '/("|/').*?/1/', $str, $matches );

preg_match( '/("|/').*?/1/', $str, $matches );

This will now correctly return:

view plain copy to clipboard print ?

"This is a 'string'"

"This is a 'string'"

Remember that back referencing may also be used by preg_replace. Note that instead of /1 … /9, you should use $1 … $9 … $n (any number of these will work). For example, if you want to replace all paragraph tags with text that represents them, use:

view plain copy to clipboard print ?

$text = preg_replace( '/<p>(.*?)<//p>/',
"<p>$1</p>", $html );

$text = preg_replace( '/<p>(.*?)<//p>/',
"&lt;p&gt;$1&lt;/p&gt;", $html );

The $1 back reference holds the text inside the paragraph and is being used in the replace pattern itself. This completely valid expression shows an easy way to access matched patterns even while replacing.

3. Named Groups

When using multiple back references, a regular expression can quickly become confusing and hard to understand. An alternative way to back reference is by using named groups. A named group is specified by using (?P<name>pattern), where name is the name of the group and pattern is the regular expression in the group itself. The group can then be referred to by (?P=name). For example, consider the following:

view plain copy to clipboard print ?

/(?P<quote>"|').*?(?P=quote)/

/(?P<quote>"|').*?(?P=quote)/

The above expression will create the same effect as the previous back reference example, but by instead using named groups. It is also significantly easier to read.

Named groups are also useful when sifting through the array of matches. The name given to a specific pattern is also the key of the corresponding matches array.

view plain copy to clipboard print ?

preg_match( '/(?P<quote>"|/')/', "'String'", $matches );
# This will print "'"
echo $matches[1];
# This will also print "'", as it is a named group
echo $matches['quote'];

preg_match( '/(?P<quote>"|/')/', "'String'", $matches );

# This will print "'"
echo $matches[1];

# This will also print "'", as it is a named group
echo $matches['quote'];

Thus, named groups not only make code easier to read but also organize it.

4. Word Boundaries

Word boundaries are places in a string that come between a word character and a non-word character. The specialty of these boundaries is the fact that they don’t actually match a character. Their length is zero. The /b regular expression matches any word boundary.

Unfortunately, boundaries are so often skimmed over that many do not recognize their real significance. For example, let’s say you want to match the word “import”:

view plain copy to clipboard print ?

/import/

/import/

Watch out! Regular expressions can be tricky. The above expression will also match:

view plain copy to clipboard print ?

important

important

You may think it is as simple as adding a space before and after import to prevent these bogus matches:

view plain copy to clipboard print ?

/ import /

/ import /

But what about this case?

view plain copy to clipboard print ?

The trader voted for the import

The trader voted for the import

When import is at the beginning or the end of a string, the modified regex will fail. Thus, splitting this up into cases is required:

view plain copy to clipboard print ?

/(^import | import | import$)/i

/(^import | import | import$)/i

Looking back at our regular expression, it does not take periods or other punctuation into account. Just to match this single word, a regular expressions may look like this:

view plain copy to clipboard print ?

/(^import(:|;|,)? | import(:|;|,)? | import(/.|/?|!)?$)/i

/(^import(:|;|,)? | import(:|;|,)? | import(/.|/?|!)?$)/i

That’s a lot of code to match just a single word. This is why word boundaries are so significant. To accomplish the above statement and many other variations with word boundaries, all that is necessary is:

view plain copy to clipboard print ?

//bimport/b/

//bimport/b/

This will match every case above and more. /b’s flexibility comes from the fact that it matches a zero-length string. All it matches is an imaginary space between two characters. It checks if one of the characters is a non-word character and the other is a word character. If so, it matches it. If the beginning or end of a string is encountered, /b treats it as a non-word character. Because the i in import is still considered a word character, it will match import.

Note that the opposite of /b is /B. This operator will match the space in-between two word or two non-word characters. Thus, if you would like to match ‘hi’ inside another word, you could use:

view plain copy to clipboard print ?

/Bhi/B

/Bhi/B

5. Atomic Groups

Atomic groups are special regex groups that are non-capturing. They are usually used to increase the efficiency of a regular expression, but may also be applied to eliminate certain matches. An atomic group is specified by using (?>pattern):

view plain copy to clipboard print ?

/(?>his|this)/

/(?>his|this)/

When the regex engine matches an atomic group, it will discard backtracting positions that came with all tokens inside it. Consider the word ’smashing’. Using the above regular expression, the regex engine will first try to match the pattern ‘his’ in ’smashing’. It will not find a match. At this point, the atomic group will kick in. The engine will discard all backtracking positions. This means that it will not search for ‘this’ inside ’smashing’. Why? If ‘his’ did not return a match, then obviously ‘this’ (which includes ‘his’) will not return positive either.

The above example did not have many practical uses. We might as well have used /t?his?/ instead. Look at the following:

view plain copy to clipboard print ?

//b(engineer|engrave|end)/b/

//b(engineer|engrave|end)/b/

If the regex engine is given the word ‘engineering’, it will correctly match ‘engineer’. The next word boundary, /b, will not match. Thus, it will move on to the next match: engrave. It realizes that the ‘eng’ matches, but the rest do not. Finally, ‘end’ is attempted and also failed. If you look carefully, you will realize that once the engine matches ‘engineer’ and fails the last word boundary, it can not possibly match ‘engrave’ or ‘end’. These two matches are smaller words than ‘engineer’, and thus the regex engine should not continue with the other trials.

view plain copy to clipboard print ?

//b(?>engineer|engrave|end)/b/

//b(?>engineer|engrave|end)/b/

The above is a much better alternative that will save the regex engine time and improve the code’s efficiency.

6. Recursion

Recursion in regular expressions can be used to match nested constructs, such as parentheses, (this (that)), and HTML tags, <div></div>. They require the use of (?R), an operator that matches recursive sub-patterns. Consider the regular expression that matches nested parentheses:

view plain copy to clipboard print ?

//(((?>[^()]+)|(?R))*/)/

//(((?>[^()]+)|(?R))*/)/

The outermost parentheses in this regular expression match the beginning of the nested constructs. Then comes an optional operator, which can either match non-parenthetical characters (?>[^()]+) or the whole expression again in a sub-pattern, (?R). Notice that this operator is repeated as many times as possible to match all nested parentheses.

Another example of recursion at work is the following:

view plain copy to clipboard print ?

/<([/w]+).*?>((?>[^<>]+)|((?R)))*<///1>/

/<([/w]+).*?>((?>[^<>]+)|((?R)))*<///1>/

The above expression combines character groups, greedy operators, back-tracking, and atomic groups to match nested tags. The first parenthesized group ([/w]+) matches the tag name for use later in the regular expression. It then proceeds to match the rest of the tag. The next parenthesized sub-expression is very similar to the one above. It either matches non-tag (?>[^<>]+) characters or recurses over another tag (?R). Finally, the last part of the expression matches the close tag.

7. Callbacks

Certain matches in a pattern may require special modifications. In order to apply multiple or complex changes, callbacks can be used. A callback is used for dynamic substitution Strings in the preg_replace_callback function. They take in a function as a parameter to use when a match is found. This function receives the match array as a parameter and returns a modified string that is used as a replacement.

As an example, consider a regular expression that changes all words to uppercase in a given string. Unfortunately, PHP does not have a regex operator that changes a character to a different case. To accomplish this task, a callback may be used. First, the expression must match all letters that need to be capitalized:

view plain copy to clipboard print ?

//b/w/

//b/w/

The above uses both word boundaries and character classes to work. Now that we have this expression, we can write a callback function:

view plain copy to clipboard print ?

function upper_case( $matches ) {
return strtoupper( $matches[0] );
}

function upper_case( $matches ) {
	return strtoupper( $matches[0] );
}

upper_case takes in an array of matches and returns the whole matched pattern in uppercase. $matches[0], in this case, represents the letter that needs to be capitalized. All of this can now be put together using the preg_replace_callback function:

view plain copy to clipboard print ?

preg_replace_callback( '//b/w/', "upper_case", $str );

preg_replace_callback( '//b/w/', "upper_case", $str );

That is the power of a simple callback.

8. Commenting

Commenting is not a way to actually match strings, but it is one of the most important parts of regular expressions. As you dive deep into larger, more complex expressions, it becomes hard to decipher what is actually being matched. Using comments in the middle of regular expressions is the perfect way to minimize such confusion.

To place a comment inside a regular expression, use the (?#comment) format. Replace “comment” with the word(s) of your choice:

view plain copy to clipboard print ?

/(?#digit)/d/

/(?#digit)/d/

It is especially important to comment regular expressions that you release to the public. Users of your regex will be able to easily understand and modify the pattern to meet their needs. It can even go so far as to help you decode it when revisiting a program.

Consider using the “x” or (?x) modifier for free-spacing mode with comments. This causes a regular expression to ignore white space between tokens. All spaces can still be represented with [ ] or / (a backslash and a space):

view plain copy to clipboard print ?

/
/d #digit
[ ] #space
/w+ #word
/x

/
/d    #digit
[ ]   #space
/w+   #word
/x

The above is the same as:

view plain copy to clipboard print ?

//d(?#digit)[ ](?#space)/w+(?#word)/

//d(?#digit)[ ](?#space)/w+(?#word)/

Always create well-documented code.

Further Resources

Regular-Expressions.info
Comprehensive website on regular expressions
Cheat Sheet
Informative regular expressions cheat sheet
Regex Generator
JavaScript regular expressions generator

About the author

Karthik Viswanathan is a high-school student who loves to program and create websites. You can view Karthik’s work on his blog, Lateral Code, and explore the most popular articles on the Web through his online Twitter application.

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。