Using Look-ahead and Look-behind

最新推荐文章于 2024-07-16 16:13:34 发布

转载最新推荐文章于 2024-07-16 16:13:34 发布 · 1.5k 阅读

PHP开发专栏收录该内容

40 篇文章

订阅专栏

本文详细介绍了正则表达式中的零宽断言概念及其应用，包括前瞻、后顾断言的语法和常见使用场景，如查找最后一个出现、字符间替换等。

If you are familiar with Perl's regular expressions, you are probably already familiar with zero-width assertions: the ^ indicating the beginning of string and the \b indicating a word boundary are examples. They do not match any characters, but "look around" to see what comes before and/or after the current position.

With the look-ahead and look-behind constructs documented in perlre.html#Extended-Patterns, you can "roll your own" zero-width assertions to fit your needs. You can look forward or backward in the string being processed, and you can require that a pattern match succeed (positive assertion) or fail (negative assertion) there.

Syntax

Every extended pattern is written as a parenthetical group with a question mark as the first character. The notation for the look-arounds is fairly mnemonic, but there are some other, experimental patterns that are similar, so it is important to get all the characters in the right order.

(?= pattern )

is a positive look-ahead assertion

(?! pattern )

is a negative look-ahead assertion

(?<= pattern )

is a positive look-behind assertion

(?<! pattern )

is a negative look-behind assertion

Notice that the = or ! is always last. The directional indicator is only present in the look-behind, and comes before the positive-or-negative indicator.

Common tasks

Finding the last occurrence

There are actually a number of ways to get the last occurrence that don't involve look-around, but if you think of "the last foo" as "foo that isn't followed by a string containing foo", you can express that notion like this:

 
  /foo(?!.*foo)/

 

 
  [download]

The regular expression engine will do its best to match .*foo, starting at the end of the string "foo". If it is able to match that, then the negative look-ahead will fail, which will force the engine to progress through the string to try the next foo.

Substituting before, after, or between characters

Many substitutions match a chunk of text and then replace part or all of it. You can often avoid that by using look-arounds. For example, if you want to put a comma after every foo:

 
  s/(?<=foo)/,/g; # Without lookbehind: s/foo/foo,/g or s/(foo)/$1,/g

 

 
  [download]

or to put the hyphen in look-ahead:

 
  s/(?<=look)(?=ahead)/-/g;

 

 
  [download]

This kind of thing is likely to be the bulk of what you use look-arounds for. It is important to remember that look-behind expressions cannot be of variable length. That means you cannot use quantifiers ( ?, *, +, or {1,5}) or alternation of different-length items inside them.

Matching a pattern that doesn't include another pattern

You might want to capture everything between foo and bar that doesn't include baz. The technique is to have the regex engine look-ahead at every character to ensure that it isn't the beginning of the undesired pattern:

 
  /foo  # Match starting at foo
 (         # Capture
 (?:       # Complex expression:
   (?!baz) #   make sure we're not at the beginning of baz 
   .       #   accept any character
 )*        # any number of times
 )         # End capture
 bar  # and ending at bar
/x;

 

 
  [download]

Nesting

You can put look-arounds inside of other look-arounds. This has been known to induce a flight response in certain readers (me, for example, the first time I saw it), but it's really not such a hard concept. A look-around sub-expression inherits a starting position from the enclosing expression, and can walk all around relative to that position without affecting the position of the enclosing expression. They all have independent (though initially inherited) bookkeeping for where they are in the string.

The concept is pretty simple, but the notation becomes hairy very quickly, so commented regular expressions are recommended. Let's look at the real example of Regex to add space after punctuation sign. The poster wants to put a space after any comma (punctuation, actually, but for simplicity, let's say comma) that is not nestled between two digits. Building up the s/// expression:

 
  s/(?<=,        # after a comma,
    (?!        # but not matching
      (?<=\d,) #   digit-comma before, AND
      (?=\d)   #   digit afterward
    )
  )/ /gx;      # substitute a space

 

 
  [download]

Note that multiple lookarounds can be used to enforce multiple conditions at the same place, like an AND condition that complements the alternation (vertical bar)'s OR. In fact, you can use Boolean algebra ( NOT (a AND b) === (NOT a OR NOT b) ) to convert the expression to use OR:

 
  s/(?<=,        # after a comma, but either
    (?:
      (?<!\d,) #   not matching digit-comma before
      |        #   OR
      (?!\d)   #   not matching digit afterward
    )
  )/ /gx;      # substitute a space

 

 
  [download]

Capturing

It is sometimes useful to use capturing parentheses within a look-around. You might think that you wouldn't be able to do that, since you're just browsing, but you can. But remember: the capturing parentheses must be within the look-around expression; from the enclosing expression's point of view, no actual matching was done by the zero-width look-around.

This is most useful for finding overlapping matches in a global pattern match. You can capture substrings without consuming them, so they are available for further matching later. Probably the simplest example is to get all right-substrings of a string:

 
  print "$1\n" while /(?=(.*))/g;

 

 
  [download]

Note that the pattern technically consumes no characters at all, but Perl knows to advance a character on an empty match, to prevent infinite looping.