正则表达式(3)：扩展符号

最新推荐文章于 2025-05-16 09:47:44 发布

原创最新推荐文章于 2025-05-16 09:47:44 发布 · 1k 阅读

1 ·

CC 4.0 BY-SA版权

爬虫专栏收录该内容

11 篇文章

订阅专栏

本文介绍了正则表达式的扩展符号，包括使点号匹配任意字符的标志和忽略空白符提高可读性的标志。还讲解了如何使用非捕获组、命名组以及模式重用的功能，以及正向和负向前视断言的应用。

扩展符号

函数/方法	描述
`re.I`、`re.IGNORECASE`	不区分大小写的匹配
`re.L`、`re.LOCALE`	根据所使用的本地语言环境通过\w、\W、\b、\B、\s、\S 实现匹配
`re.M`、`re.MULTILINE`	^和$分别匹配目标字符串中行的起始和结尾，而不是严格匹配整个字符串本身的起始和结尾
`re.S`、`re.DPTALL`	“.”(点号)通常匹配除了\n(换行符)之外的所有单个字符；该标记表示"."(点号)能够匹配全部字符
`re.X`、`re.VERBOSE`	通过反斜线转义，否则所有空格加上#(以及在该行中所有的后续文字)都被忽略，除非在一个字符类中或者允许注释并且提高可读性

通过使用(?iLmsux)系列选项，用户可以直接在正则表达式里面指定一个或者多个标记，而不是通过compole()或者其他re模块函数。

re.I/IGNORECASE和re.M/MULTILINE
示例代码如下：

>>> re.findall(r'(?i)yes','yes? Yes. Yes!')
['yes', 'Yes', 'Yes']
>>> re.findall(r'(?i)th\w+', 'The quickest way is through this tunnel.')
['The', 'through', 'this']
>>> re.findall(r'(?im)(^th[\w ]+)', """
... This line is the first,
... another line,
... that line, it's the best
... """)
['This line is the first', 'that line']

re.S/DOTALL
该标记表明点号(.)能够用来表示\n符号（反之其通常用于表示除了\n之外的全部字符）：

>>> re.findall(r'th.+', '''
... The first line
... the second line
... the third line
... ''')
['the second line', 'the third line']
>>> re.findall(r'(?s)th.+','''
... The first line
... the second line
... the third line
... ''')
['the second line\nthe third line\n']

re.X/VERBOSE
该标记允许用户通过抑制在正则表达式中使用空白符来创建更易读的正则表达式。示例代码如下：

>>> re.search(r'''(?x)
...     \((\d{3})\)  # 区号
...     [ ]           # 空白符
...     (\d{3})      # 前缀
...     -             # 横线
...     (\d{4})      # 终点数字
... ''', '(800) 555-1212').groups()
('800', '555', '1212')

(?:...)、(?P<name>)和(?P=name)符号

使用(?:...)符号，可以对部分正则表达式进行分组，但是并不会保存该分组用于后续的检索或者应用。

>>> re.findall(r'http://(?:\w+\.)*(\w+\.com)',
...     'http://google.com http://www.google.com http://code.google.com')
['google.com', 'google.com', 'google.com']
>>> re.search(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})',
...     '(800) 555-1212').groupdict()
{'areacode': '800', 'prefix': '555'}

可以同时使用(?P<name>)和(?P=name)符号。

>>> re.sub(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3}-(?:\d{4}))',
...     '(\g<areacode>) \g<prefix>-xxxx', '(800) 555-1212')
'(800) 555-1212-xxxx'

(?P<name>)通过使用一个名称标识符而不是使用从1开始到N的增量数字来保存匹配。如果使用数字来保存匹配结果，我们就可以通过使用\1,\2…，\N\来检索。也可以使用一个类似风格的\g<name>来检索它们。
使用(?P=name)，可以在相同的正则表达式中重用模式，而不必稍后再次在（相同）正则表达式中指定相同的模式。

    >>> bool(re.match(r'''(?x)
...
...     # match (800) 555-1212, save areacode, prefix, no
...     \((?P<areacode>\d{3})\)[ ](?P<prefix>\d{3})-(?P<number>\d{4})
...
...     # space
...     [ ]
...
...     # match 800-555-1212
...     (?P=areacode)-(?P=prefix)-(?P=number)
...
...     # space
...     [ ]
...
...     # match 18005551212
...     1(?P=areacode)(?P=prefix)(?P=number)
...
... ''', '(800) 555-1212 800-555-1212 18005551212'))
True

(?=...)和(?!...)
可以使用(?=...)和(?!...)符号在目标字符串中实现一个前视匹配，而不必实际上使用这些字符串。前者是正向前视断言，后者是负向前视断言。

>>> re.findall(r'\w+(?= van Rossum)',
... '''
...     Guido van Rossum
...     Tim Peters
...     Just van Rossum
... ''')
['Guido', 'Just']

>>> re.findall(r'(?m)^\s+(?!noreply|postmaster)(\w+)',
... '''
...     sales@phptr.com
...     postmaster@phptr.com
...     eng@phptr.com
...     noreply@phptr.com
...     admin@phptr.com
... ''')
['sales', 'eng', 'admin']

>>> ['%s@aw.com' % e.group(1) for e in \
... re.finditer(r'(?m)^\s+(?!noreply|postmaster)(\w+)',
... '''
...     sales@phptr.com
...     postmaster@ohotr.com
...     eng@phptr.com
...     noreply@phptr.com
...     admin@phptr.com
... ''')]
['sales@aw.com', 'eng@aw.com', 'admin@aw.com']