Crucial Concepts Behind Advanced Regular Expressions

本文介绍了高级正则表达式的八大核心概念,包括贪婪与懒惰匹配、回溯引用、命名组、单词边界、原子组、递归、回调及注释,通过实例展示了这些概念的应用场景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Crucial Concepts Behind Advanced Regular Expressions
By Karthik Viswanathan

 

Regular expressions (or regex) are a powerful way to traverse large strings in order to find information. They rely on underlying patterns in a string’s structure to work their magic. Unfortunately, simple regular expressions are unable to cope with complex patterns and symbols. To deal with this dilemma, you can use advanced regular expressions.

Below, we present an introduction to advanced regular expressions, with eight commonly used concepts and examples. Each example outlines a simple way to match patterns in complex strings. If you do not yet have experience with basic regular expressions, have a look at this article to get started. The syntax used here matches PHP’s Perl-compatible regular expressions.

1. Greediness/Laziness

All regex repetition operators are greedy. They try to match as much as possible in a string. Unfortunately, this might not always be a desired effect. Thus, lazy operators are used to solve this problem. They only match the smallest possible pattern and are used by adding a ‘?’ after the respective greedy operator. Alternatively, the ‘U’ modifier may be used to make all repetiton operators lazy. Differentiating between greediness and laziness is key to fully understanding advanced regular expressions.

Greedy Operators

The * operator matches the previous expression 0 or more times. It is a greedy operator. Consider the following expression:

  1. preg_match( '/<h1>.*<//h1>/', '<h1>This is a heading.</h1>   
  2. <h1>This is another one.</h1>', $matches );  
preg_match( '/<h1>.*<//h1>/', '<h1>This is a heading.</h1>
<h1>This is another one.</h1>', $matches );

 

Recall that a . means any character except a new line. The above regular expression is looking for an h1 tag and all of its contents. It uses the . and * operators to constantly match anything inside the tag. This pattern will match:

  1. <h1>This is a heading.</h1><h1>This is another one.</h1>  
<h1>This is a heading.</h1><h1>This is another one.</h1>

 

It returns the whole string. The * operator will continuously match everything — even the middle closing h1 tag — because it is greedy. Matching the whole string is the best it can do.

Lazy Operators

Let’s change the above operator by adding a ‘?’ after it. This will make it lazy:

  1. /<h1>.*?<//h1>/  
/<h1>.*?<//h1>/

 

The regex now fulfills its duty and matches only the first h1 tag. Another greedy operator that uses this same property is {n,}. This matches the previous expression n or more times. If it is used without a question mark, it looks for the most repetitions possible. Otherwise, it starts from n repetitions:

  1. # Set up a String   
  2. $str = 'hihi';   
  3.   
  4. # Match it using the greedy {n,} operator   
  5. preg_match( '/(hi){1,}/'$str$matches ); # matches[0] will be 'hihi'   
  6.   
  7. # Match it with the lazy {n,}? operator   
  8. preg_match( '/(hi){1,}?/'$str$matches ); # matches[0] will be 'hi'  
# Set up a String
$str = 'hihi';

# Match it using the greedy {n,} operator
preg_match( '/(hi){1,}/', $str, $matches ); # matches[0] will be 'hihi'

# Match it with the lazy {n,}? operator
preg_match( '/(hi){1,}?/', $str, $matches ); # matches[0] will be 'hi'

 

2. Back Referencing

What it does

Back referencing is a way to refer to previously matched patterns inside a regular expression. For example, take a look at this simple regex that matches an expression in quotes:

  1. # Set up an array of matches   
  2. $matches = array();   
  3.   
  4. # Create a String   
  5. $str = "/"This is a 'string'/"";   
  6.   
  7. # Traverse it with regular expressions   
  8. preg_match( "/(/"|').*?(/"|')/"$str$matches );   
  9.   
  10. # Print the whole match   
  11. echo  $matches[0];  
# Set up an array of matches
$matches = array();

# Create a String
$str = "/"This is a 'string'/"";

# Traverse it with regular expressions
preg_match( "/(/"|').*?(/"|')/", $str, $matches );

# Print the whole match
echo  $matches[0];

 

Unfortunately, this will not correctly match the string. Instead, it will print:

  1. "This is a '  
"This is a '

 

This regular expression matches the opening double quote but finds a different type of quote to close it. This is because it was given the option of picking a double or single quote at the end. In order to fix this, you can use back referencing. The expressions /1, /2, …., /9 hold references to previously captured subpatterns. The first matched quote, in this case, will be held by the variable /1.

How to Use It

In order to apply this concept to the aforementioned example, use /1 in place of the last quote:

  1. preg_match( '/("|/').*?/1/'$str$matches );  
preg_match( '/("|/').*?/1/', $str, $matches );

 

This will now correctly return:

  1. "This is a 'string'"  
"This is a 'string'"

 

Remember that back referencing may also be used by preg_replace. Note that instead of /1 … /9, you should use $1 … $9 … $n (any number of these will work). For example, if you want to replace all paragraph tags with text that represents them, use:

  1. $text = preg_replace( '/<p>(.*?)<//p>/',   
  2. "&lt;p&gt;$1&lt;/p&gt;"$html );  
$text = preg_replace( '/<p>(.*?)<//p>/',
"&lt;p&gt;$1&lt;/p&gt;", $html );

 

The $1 back reference holds the text inside the paragraph and is being used in the replace pattern itself. This completely valid expression shows an easy way to access matched patterns even while replacing.

3. Named Groups

When using multiple back references, a regular expression can quickly become confusing and hard to understand. An alternative way to back reference is by using named groups. A named group is specified by using (?P<name>pattern), where name is the name of the group and pattern is the regular expression in the group itself. The group can then be referred to by (?P=name). For example, consider the following:

  1. /(?P<quote>"|').*?(?P=quote)/  
/(?P<quote>"|').*?(?P=quote)/

 

The above expression will create the same effect as the previous back reference example, but by instead using named groups. It is also significantly easier to read.

Named groups are also useful when sifting through the array of matches. The name given to a specific pattern is also the key of the corresponding matches array.

  1. preg_match( '/(?P<quote>"|/')/', "'String'", $matches );   
  2.   
  3. # This will print "'"   
  4. echo $matches[1];   
  5.   
  6. # This will also print "'", as it is a named group   
  7. echo $matches['quote'];  
preg_match( '/(?P<quote>"|/')/', "'String'", $matches );

# This will print "'"
echo $matches[1];

# This will also print "'", as it is a named group
echo $matches['quote'];

 

Thus, named groups not only make code easier to read but also organize it.

4. Word Boundaries

Word boundaries are places in a string that come between a word character and a non-word character. The specialty of these boundaries is the fact that they don’t actually match a character. Their length is zero. The /b regular expression matches any word boundary.

Unfortunately, boundaries are so often skimmed over that many do not recognize their real significance. For example, let’s say you want to match the word “import”:

/import/

 

Watch out! Regular expressions can be tricky. The above expression will also match:

important

 

You may think it is as simple as adding a space before and after import to prevent these bogus matches:

  1. / import /  
/ import /

 

But what about this case?

  1. The trader voted for the import  
The trader voted for the import

 

When import is at the beginning or the end of a string, the modified regex will fail. Thus, splitting this up into cases is required:

  1. /(^import | import | import$)/i  
/(^import | import | import$)/i

 

Looking back at our regular expression, it does not take periods or other punctuation into account. Just to match this single word, a regular expressions may look like this:

  1. /(^import(:|;|,)? | import(:|;|,)? | import(/.|/?|!)?$)/i  
/(^import(:|;|,)? | import(:|;|,)? | import(/.|/?|!)?$)/i

That’s a lot of code to match just a single word. This is why word boundaries are so significant. To accomplish the above statement and many other variations with word boundaries, all that is necessary is:

  1. //bimport/b/  
//bimport/b/

 

This will match every case above and more. /b’s flexibility comes from the fact that it matches a zero-length string. All it matches is an imaginary space between two characters. It checks if one of the characters is a non-word character and the other is a word character. If so, it matches it. If the beginning or end of a string is encountered, /b treats it as a non-word character. Because the i in import is still considered a word character, it will match import.

Note that the opposite of /b is /B. This operator will match the space in-between two word or two non-word characters. Thus, if you would like to match ‘hi’ inside another word, you could use:

/Bhi/B

 

5. Atomic Groups

Atomic groups are special regex groups that are non-capturing. They are usually used to increase the efficiency of a regular expression, but may also be applied to eliminate certain matches. An atomic group is specified by using (?>pattern):

  1. /(?>his|this)/  
/(?>his|this)/

 

When the regex engine matches an atomic group, it will discard backtracting positions that came with all tokens inside it. Consider the word ’smashing’. Using the above regular expression, the regex engine will first try to match the pattern ‘his’ in ’smashing’. It will not find a match. At this point, the atomic group will kick in. The engine will discard all backtracking positions. This means that it will not search for ‘this’ inside ’smashing’. Why? If ‘his’ did not return a match, then obviously ‘this’ (which includes ‘his’) will not return positive either.

The above example did not have many practical uses. We might as well have used /t?his?/ instead. Look at the following:

  1. //b(engineer|engrave|end)/b/  
//b(engineer|engrave|end)/b/

 

If the regex engine is given the word ‘engineering’, it will correctly match ‘engineer’. The next word boundary, /b, will not match. Thus, it will move on to the next match: engrave. It realizes that the ‘eng’ matches, but the rest do not. Finally, ‘end’ is attempted and also failed. If you look carefully, you will realize that once the engine matches ‘engineer’ and fails the last word boundary, it can not possibly match ‘engrave’ or ‘end’. These two matches are smaller words than ‘engineer’, and thus the regex engine should not continue with the other trials.

  1. //b(?>engineer|engrave|end)/b/  
//b(?>engineer|engrave|end)/b/

 

The above is a much better alternative that will save the regex engine time and improve the code’s efficiency.

6. Recursion

Recursion in regular expressions can be used to match nested constructs, such as parentheses, (this (that)), and HTML tags, <div></div>. They require the use of (?R), an operator that matches recursive sub-patterns. Consider the regular expression that matches nested parentheses:

  1. //(((?>[^()]+)|(?R))*/)/  
//(((?>[^()]+)|(?R))*/)/

 

The outermost parentheses in this regular expression match the beginning of the nested constructs. Then comes an optional operator, which can either match non-parenthetical characters (?>[^()]+) or the whole expression again in a sub-pattern, (?R). Notice that this operator is repeated as many times as possible to match all nested parentheses.

Another example of recursion at work is the following:

  1. /<([/w]+).*?>((?>[^<>]+)|((?R)))*<///1>/  
/<([/w]+).*?>((?>[^<>]+)|((?R)))*<///1>/

The above expression combines character groups, greedy operators, back-tracking, and atomic groups to match nested tags. The first parenthesized group ([/w]+) matches the tag name for use later in the regular expression. It then proceeds to match the rest of the tag. The next parenthesized sub-expression is very similar to the one above. It either matches non-tag (?>[^<>]+) characters or recurses over another tag (?R). Finally, the last part of the expression matches the close tag.

7. Callbacks

Certain matches in a pattern may require special modifications. In order to apply multiple or complex changes, callbacks can be used. A callback is used for dynamic substitution Strings in the preg_replace_callback function. They take in a function as a parameter to use when a match is found. This function receives the match array as a parameter and returns a modified string that is used as a replacement.

As an example, consider a regular expression that changes all words to uppercase in a given string. Unfortunately, PHP does not have a regex operator that changes a character to a different case. To accomplish this task, a callback may be used. First, the expression must match all letters that need to be capitalized:

//b/w/

 

The above uses both word boundaries and character classes to work. Now that we have this expression, we can write a callback function:

  1. function upper_case( $matches ) {   
  2.     return strtoupper$matches[0] );   
  3. }  
function upper_case( $matches ) {
	return strtoupper( $matches[0] );
}

 

upper_case takes in an array of matches and returns the whole matched pattern in uppercase. $matches[0], in this case, represents the letter that needs to be capitalized. All of this can now be put together using the preg_replace_callback function:

  1. preg_replace_callback( '//b/w/'"upper_case"$str );  
preg_replace_callback( '//b/w/', "upper_case", $str );

 

That is the power of a simple callback.

8. Commenting

Commenting is not a way to actually match strings, but it is one of the most important parts of regular expressions. As you dive deep into larger, more complex expressions, it becomes hard to decipher what is actually being matched. Using comments in the middle of regular expressions is the perfect way to minimize such confusion.

To place a comment inside a regular expression, use the (?#comment) format. Replace “comment” with the word(s) of your choice:

  1. /(?#digit)/d/  
/(?#digit)/d/

 

It is especially important to comment regular expressions that you release to the public. Users of your regex will be able to easily understand and modify the pattern to meet their needs. It can even go so far as to help you decode it when revisiting a program.

Consider using the “x” or (?x) modifier for free-spacing mode with comments. This causes a regular expression to ignore white space between tokens. All spaces can still be represented with [ ] or / (a backslash and a space):

  1. /   
  2. /d    #digit   
  3. [ ]   #space   
  4. /w+   #word   
  5. /x  
/
/d    #digit
[ ]   #space
/w+   #word
/x

 

The above is the same as:

  1. //d(?#digit)[ ](?#space)/w+(?#word)/  
//d(?#digit)[ ](?#space)/w+(?#word)/

 

Always create well-documented code.

Further Resources

About the author

Karthik Viswanathan is a high-school student who loves to program and create websites. You can view Karthik’s work on his blog, Lateral Code, and explore the most popular articles on the Web through his online Twitter application.

【基于QT的调色板】是一个使用Qt框架开发的色彩选择工具,类似于Windows操作系统中常见的颜色选取器。Qt是一个跨平台的应用程序开发框架,广泛应用于桌面、移动和嵌入式设备,支持C++和QML语言。这个调色板功能提供了横竖两种渐变模式,用户可以方便地选取所需的颜色值。 在Qt中,调色板(QPalette)是一个关键的类,用于管理应用程序的视觉样式。QPalette包含了一系列的颜色角色,如背景色、前景色、文本色、高亮色等,这些颜色可以根据用户的系统设置或应用程序的需求进行定制。通过自定义QPalette,开发者可以创建具有独特视觉风格的应用程序。 该调色板功能可能使用了QColorDialog,这是一个标准的Qt对话框,允许用户选择颜色。QColorDialog提供了一种简单的方式来获取用户的颜色选择,通常包括一个调色板界面,用户可以通过滑动或点击来选择RGB、HSV或其他色彩模型中的颜色。 横渐变取色可能通过QGradient实现,QGradient允许开发者创建线性或径向的色彩渐变。线性渐变(QLinearGradient)沿直线从一个点到另一个点过渡颜色,而径向渐变(QRadialGradient)则以圆心为中心向外扩散颜色。在调色板中,用户可能可以通过滑动条或鼠标拖动来改变渐变的位置,从而选取不同位置的颜色。 竖渐变取色则可能是通过调整QGradient的方向来实现的,将原本水平的渐变方向改为垂直。这种设计可以提供另一种方式来探索颜色空间,使得选取颜色更为直观和便捷。 在【colorpanelhsb】这个文件名中,我们可以推测这是与HSB(色相、饱和度、亮度)色彩模型相关的代码或资源。HSB模型是另一种常见且直观的颜色表示方式,与RGB或CMYK模型不同,它以人的感知为基础,更容易理解。在这个调色板中,用户可能可以通过调整H、S、B三个参数来选取所需的颜色。 基于QT的调色板是一个利用Qt框架和其提供的色彩管理工具,如QPalette、QColorDialog、QGradient等,构建的交互式颜色选择组件。它不仅提供了横竖渐变的色彩选取方式,还可能支持HSB色彩模型,使得用户在开发图形用户界面时能更加灵活和精准地控制色彩。
标题基于Spring Boot的二手物品交易网站系统研究AI更换标题第1章引言阐述基于Spring Boot开发二手物品交易网站的研究背景、意义、现状及本文方法与创新点。1.1研究背景与意义介绍二手物品交易的市场需求和Spring Boot技术的适用性。1.2国内外研究现状概述当前二手物品交易网站的发展现状和趋势。1.3论文方法与创新点说明本文采用的研究方法和在系统设计中的创新之处。第2章相关理论与技术介绍开发二手物品交易网站所涉及的相关理论和关键技术。2.1Spring Boot框架解释Spring Boot的核心概念和主要特性。2.2数据库技术讨论适用的数据库技术及其在系统中的角色。2.3前端技术阐述与后端配合的前端技术及其在系统中的应用。第3章系统需求分析详细分析二手物品交易网站系统的功能需求和性能需求。3.1功能需求列举系统应实现的主要功能模块。3.2性能需求明确系统应满足的性能指标和安全性要求。第4章系统设计与实现具体描述基于Spring Boot的二手物品交易网站系统的设计和实现过程。4.1系统架构设计给出系统的整体架构设计和各模块间的交互方式。4.2数据库设计详细阐述数据库的结构设计和数据操作流程。4.3界面设计与实现介绍系统的界面设计和用户交互的实现细节。第5章系统测试与优化说明对系统进行测试的方法和性能优化的措施。5.1测试方法与步骤测试环境的搭建、测试数据的准备及测试流程。5.2测试结果分析对测试结果进行详细分析,验证系统是否满足需求。5.3性能优化措施提出针对系统性能瓶颈的优化建议和实施方案。第6章结论与展望总结研究成果,并展望未来可能的研究方向和改进空间。6.1研究结论概括本文基于Spring Boot开发二手物品交易网站的主要发现和成果。6.2展望与改进讨论未来可能的系统改进方向和新的功能拓展。
1. 用户与权限管理模块 角色管理: 学生:查看个人住宿信息、提交报修申请、查看卫生检查结果、请假外出登记 宿管人员:分配宿舍床位、处理报修申请、记录卫生检查结果、登记晚归情况 管理员:维护楼栋与房间信息、管理用户账号、统计住宿数据、发布宿舍通知 用户操作: 登录认证:对接学校统一身份认证(模拟实现,用学号 / 工号作为账号),支持密码重置 信息管理:学生完善个人信息(院系、专业、联系电话),管理员维护所有用户信息 权限控制:不同角色仅可见对应功能(如学生无法修改床位分配信息) 2. 宿舍信息管理模块 楼栋与房间管理: 楼栋信息:名称(如 "1 号宿舍楼")、层数、性别限制(男 / 女 / 混合)、管理员(宿管) 房间信息:房间号(如 "101")、户型(4 人间 / 6 人间)、床位数量、已住人数、可用状态 设施信息:记录房间内设施(如空调、热水器、桌椅)的配置与完好状态 床位管理: 床位编号:为每个床位设置唯一编号(如 "101-1" 表示 101 房间 1 号床) 状态标记:标记床位为 "空闲 / 已分配 / 维修中",支持批量查询空闲床位 历史记录:保存床位的分配变更记录(如从学生 A 调换到学生 B 的时间与原因) 3. 住宿分配与调整模块 住宿分配: 新生分配:管理员导入新生名单后,宿管可按专业集中、性别匹配等规则批量分配床位 手动分配:针对转专业、复学学生,宿管手动指定空闲床位并记录分配时间 分配结果公示:学生登录后可查看自己的宿舍信息(楼栋、房间号、床位号、室友列表) 调整管理: 调宿申请:学生提交调宿原因(如室友矛盾、身体原因),选择意向宿舍(需有空位) 审批流程:宿管审核申请,通过后执行床位调换,更新双方住宿信息 换宿记录:保存调宿历史(申请人、原床位、新床位、审批人、时间) 4. 报修与安全管理模块 报修管理: 报修提交:学生选择宿舍、设施类型(如 "
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值