Python正则表达式

最新推荐文章于 2025-04-29 15:28:57 发布

KILL_USR

最新推荐文章于 2025-04-29 15:28:57 发布

阅读量823

点赞数

文章标签： python openstack

本文链接：https://blog.youkuaiyun.com/shajc0504/article/details/39611063

版权

本文介绍了Python中的正则表达式，包括括号分组、扩展符号的使用，以及Python re模块的功能。详细讲解了括号分组在匹配浮点数字和名字姓氏中的应用，扩展符号如问号开头的特殊含义，以及Python re模块的match、search和compile等函数。此外，还探讨了正则表达式的实践应用，如findall、finditer等方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

参考《Core PYTHON Applications Programming Third Edition》

1. 符号和字符

正则表达式在各语言中基本语法是相同，不同的语言可能会有一些高级特性，而这些高级特性及语法可能有些区别。

记号	描述	示例表达式
Symbols
literal	匹配字符串	foo
re1\|re2	匹配正则表达式re1或re2	foo\|bar
.	匹配任意字符串(除了“\n”)	b.b
^	匹配字符串开头	^Dear
$	匹配字符串结尾	/bin/*sh$
*	匹配0个以上前一个正则表达式或字符	[A-Za-z0-9]*
+	匹配1个以上前一个正则表达式或字符	[a-z]+\.com
?	匹配0个或1个前一个正则表达式或字符	goo?
{N}	匹配N个前一个正则表达式或字符	[0-9]{3}
{M,N}	匹配M~N个前一个正则表达式或字符	[0-9]{5,9}
[...]	匹配字符组中任意单个字符	[aeiou]
[..x-y..]	匹配x-y的范围里的任意单个字符	[0-9],[A-Za-z]
[^...]	不匹配字符组中任意字符	[^aeiou],[^A-Za-z0-9_]
(*\|+\|?\|{})?	使用“非贪婪”版的以上重复匹配的符号(*, +, ?, {})	.*?[a-z]
(...)	匹配括号中正则表达式并另存为subgroup	([0-9]{3})?,f(oo\|u)bar
Special Characters
\d	匹配任意十进制数字，和[0-9]一样，\D则反之	data\d+.txt
\w	匹配任意字母数字字符，和[A-Za-z0-9_]一样，\W则反之	[A-Za-z_]\w+
\s	匹配空白字符，和[ \n\t\r\v\f]一样，\S则反之	of\sthe
\b	匹配热议字符边界，\B反之	\bThe\b
\N	匹配已保存的subgroup N（看上面的(...)）	price: \16
\c	逐字匹配任意特殊字符c (也就是去掉字符的特殊含义，按字面匹配)	\., \\, \*
\A (\Z)	匹配哦字符的开始（结尾）(和 ^ $一样)	\ADear
Extension Notation
(?iLmsux)	在正则表达式中嵌入一个或多个特殊“标志”参数（相对于通过function/method）	(?x), (?im)
(?:...)	表示一个组匹配后不保存用来检索或使用	(?:\w+\.)*
(?P<name>...)	类似于正则群只匹配与名字一致，而不是数字ID	(?P<data>)
(?P=name)	匹配先前(?P<name>)分组的相同字符串	(?P=data)
(?#...)	指定注释，所有内容都忽略掉	(?#comment)
(?=...)	如果字符串后面出现...，则成功，匹配操作不消耗字符串；称之为前向肯定断言	(?=.com)
(?!...)	如果字符串后面没有...，则成功；称之为前向否定断言	(?!.net)
(?<=...)	如果...在当前匹配点前出现，则成功; 称之为后向肯定断言	(?<=800-)
(?<!...)	如果...没有在当前匹配点钱出现，则成功;称之为后向否定断言	(?<!192\.168\.)
(?(id/name)Y\|N)	如果group和给定的id或name存在，则有条件的匹配正则表达式Y，其他就是N，N可选。	(?(1)y\|x

括号分组

主要作用：正则表达式分组、匹配子组。

\d+(\.\d*)? 表示简单简单的浮点数字，及任意十进制数字，后面跟一个可选的小数点，如“0.004,” “2,” “75.,”。

(Mr?s?\. )?[A-Z] [a-z]* [ A-Za-z-]+ 名字姓氏，名字首字母不许大写，全名前可选称谓，如：“Mr.,” “Mrs.,” “Ms.,” 或“M.,”。

扩展符号

这里说下以问号开头的扩展符号(? . . . )，(?P<name>)表示一组匹配值。

(?:\w+\.)*    以点号结尾，如“google.”, “twitter.”,“facebook.”，但这类匹配不会保存下来供后面使用或检索。
(?#comment)     只是一个注释。
(?=.com)        只匹配字符串后有“.com”的，不会使用字符串的其他部分。
(?!.net) 只匹配字符串后没有“.net”的。
(?<=800-)   只对字符串以“800-”开头进行匹配。
(?<!192\.168\.) 如果字符串不以“192.168.”开头，则匹配。
(?(1)y|x)       如果存在匹配组1，对y或x进行匹配。

2.Python和正则表达式

Python通过re模块来支持正则表达式。re模块支持更强大和更正规的Perl风格的正则表达式，允许多线程来共享编译过的正则表达式对象，同时支持命名子组。

re模块

re模块是核心是python的核心函数和方法。很多函数也可作为已编译的正则表达式对象（正则对象和正则匹配对象）。先研究下match（）和search（）函数和方法，以及compile（）函数。

仅为re模块函数
compile(pattern,flags=0) 编译正则表达式，flag为可选参数，并返回一个正则对象

re模块函数和正则对象方法

match(pattern,string,flags=0)从字符串开始尝试与正则表达式匹配，flag为可选参数，如果匹配上则返回匹配成功的对象，否则返回none。
search(pattern,string,flags=0) 在字符串中匹配正则表达式（任意位置，只匹配第一个出现的），flag为可选参数，如果匹配上则返回匹配成功的对象，否则返回none。

findall(pattern,string[,flags]) 在字符串中查找所有匹配上的正则表达式，并返回一个匹配成功的对象列表。
finditer(pattern,string[, flags]) 和findall()一样，但是返回的是一个迭代器；每次匹配，迭代器返回一个匹配到的对象。
split(pattern,string, max=0) 将正则表达式作为分界符，将字符串分割成一个列表，并返回成功匹配项的列表。
sub(pattern,repl,string, count=0) 将字符串中所有匹配上的正则表达式替换成repl，如果指定count，则替换相应数目个（subn()返回替代数目）
purge() 清除隐式编译好的正则表达式的缓存

通用匹配对象方法
group(num=0)返回整个匹配对象（或指定子组）
groups(default=None)返回一个包含所有匹配子组的元组（没有匹配上则为空）
groupdict(default=None) 返回一个包含所有匹配的命名子组的字典，以名字作为关键字。

通用模块属性（大多正则函数的标志）
re.I,re.IGNORECASE      不区分大小写；
re.L,re.LOCALE          根据locale通过\w,\W,\b,\B,\s,\S进行匹配；
re.M,re.MULTILINE 多行模式，使^和$分别匹配每行的目标字符串的开头和结尾，而不是整个字符串本身的开头和结尾；
re.S,re.DOTALL          正常“.”匹配除了“\n”外的任意单个字符，使用re.S后就可以匹配所有字符；
re.X,re.VERBOSE 可以使用空格加#来进行注释，除非是在一个字符类里或反斜杠后（转义），可以通过注释并提高可读性

3.实践

match（）

>>> import re
>>> m = re.match('abc', 'abc')
>>> m
<_sre.SRE_Match object at 0x7f72d289d920>
>>> m.group()
'abc'
>>> if m is not None: m.group() #可以加个判断
... 
'abc'
>>> m = re.match('bc', 'abc')    #必须从头匹配，匹配失败，m为None
>>> m
>>> m = re.match('abc', 'abc defg')   #加长，匹配依旧成功
>>> m
<_sre.SRE_Match object at 0x7f72d289d4a8>
>>> m.group()
'abc'
>>> re.match('abc', 'abc defg').group()   #可以写成这样
'abc'
>>> re.match('bc', 'abc defg').group()    #匹配失败
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>>

search（）

>>> import re
>>> m = re.search('foo', 'seafood') # 匹配成功,不需要行首
>>> if m is not None: m.group()
...
'foo'
>>>

findall（）

>>> re.findall('car', 'car')
['car']
>>> re.findall('car', 'scary')
['car']
>>> re.findall('car', 'carry the barcardi to the car')   #匹配所有匹配项，不分位置
['car', 'car', 'car']
>>>

finditer（）：迭代器是访问集合内元素的一种方式。迭代器对象调用next()方法来逐一访问元素，当没有后续元素时，会引发一个StopIteration异常通知。

>>> m = 'abc'
>>> it = iter(m)  #iter（）函数
>>> it.next()
'a'
>>> it.next()
'b'
>>> it.next()
'c'
>>> it.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>> it = re.finditer(r'\w', m)  #使用finditer（）    
>>> it.next().group()         
'a'
>>> it.next().group()
'b'
>>> it.next().group()
'c'
>>> it.next().group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>>

符号匹配

>>> test = 'cat|car'
>>> m = re.match(test, 'catcar')
>>> m.group()
'cat'
>>> n = re.search(test, 'abcdcatcar')  #匹配第一个出现的cat
>>> n.group()
'cat'
>>> n = re.search(test, 'abcdcarcat')  #匹配第一个出现的car
>>> n.group()
'car'
>>> re.findall(test, 'carry the barcardi to the car,cat is catting file catalog')  #匹配所有
['car', 'car', 'car', 'cat', 'cat', 'cat']
>>>

分组

>>> m = re.match('(\w+)-(\d+)', 'abc-123')        
>>> m.group()  #匹配项 
'abc-123'
>>> m.groups() 	#所有子组
('abc', '123')
>>> m.group(1)
'abc'
>>> m.group(2)
'123'
>>> m = re.match('ab', 'ab') # 无子组
>>> m.group()
'ab'
>>> m.groups()
()
>>> m = re.match('(ab)', 'ab') # 一个子组
>>> m.group() 
'ab'
>>> m.group(1)
'ab'
>>> m.groups()
('ab',)
>>>
>>> m = re.match('(a)(b)', 'ab') #两个子组
>>> m.group()
'ab'
>>> m.group(1)
'a'
>>> m.group(2)
'b'
>>> m.groups()
('a', 'b')
>>>
>>> m = re.match('(a(b))', 'ab') #两个子组
>>> m.group()
'ab'
>>> m.group(1)
'ab'
>>> m.group(2)
'b'
>>> m.groups()
('ab', 'b')

使用sub()和subn()

>>> re.sub('x', 'a', 'xxx x x x')  
'aaa a a a'
>>> re.subn('x', 'a', 'xxx x x x')
('aaa a a a', 6)
>>> re.sub('[xy]', 'A', 'xyz x y z') 
'AAz A A z'
>>> re.subn('[xy]', 'A', 'xyz x y z')
('AAz A A z', 4)
>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2})', r'\2/\1/\3', '2/20/91')  #匹配对象的group()方法来作为替代项，\N表示第N个子组
'20/2/91''
>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{4})', r'\3/\1/\2', '2/20/1991') 
'1991/2/20'
>>> <span style="font-family: Arial, Helvetica, sans-serif;"></span>

split()

>>> re.split('/', 'a/b/c:d')
['a', 'b', 'c:d']
>>> re.split('/|:', 'a/b/c:d')
['a', 'b', 'c', 'd']
>>>

扩展符号

(?iLmsux) 一个集合，可以直接指定一个或多个标志（如re.I|re.M）。

>>>> re.findall(r'(?i)yes', 'yes? Yes. YES!!')        #忽略大小写
['yes', 'Yes', 'YES']
>>> re.findall(r'(^th[\w ]+)', '''This line,\nanother line,\nthat line, it's the best\n''',re.I|re.M)  #忽略大小写，多行模式   
['This line', 'that line']
>>> re.findall(r'th.+', '''The first line\nthe second line\nthe third line''')  #原始“.”不匹配“\n”
['the second line', 'the third line']
>>> re.findall(r'(?s)th.+', '''The first line\nthe second line\nthe third line''') #re.S，可以匹配“\n”
['the second line\nthe third line']
>>> re.search(r'''(?x)   #re.X可以添加注释
...     (\d{3}|\d{4}) # area code   
...     -             # dash        
...     (\d+)         # phone number
... ''', '021-54321678').groups()   
('021', '54321678')
>>> re.findall(r'(\w+\.)*(\w+\.org)', 'www.openstack.org docs.openstack.org wiki.openstack.org')  
[('www.', 'openstack.org'), ('docs.', 'openstack.org'), ('wiki.', 'openstack.org')]
>>> re.findall(r'(?:\w+\.)*(\w+\.org)', 'www.openstack.org docs.openstack.org wiki.openstack.org')  #第一段不输出
['openstack.org', 'openstack.org', 'openstack.org']
>>> re.findall(r'(\w+\.)*(?:\w+\.org)', 'www.openstack.org docs.openstack.org wiki.openstack.org')  
['www.', 'docs.', 'wiki.']
>>> re.findall(r'(?:\w+\.)*(?:\w+\.org)', 'www.openstack.org docs.openstack.org wiki.openstack.org') #这个没想明白
['www.openstack.org', 'docs.openstack.org', 'wiki.openstack.org']
>>>

(?P<name>)和(?P=name)搭配使用

>>> re.search(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})',
...     '(800) 555-1212').groupdict()                                
{'areacode': '800', 'prefix': '555'}
>>> re.sub(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})',
...     '(\g<areacode>) \g<prefix>-xxxx', '(800) 555-1212')
'(800) 555-xxxx'
>>>

(?=...)和(?!...)分别为前向肯定断言和前向否定断言。

>>> re.findall(r'\w+(?= Tom)',
... '''
...     Small Tom
...     Old Tom
...     Small Jack
... ''')
['Small', 'Old']
>>> re.findall(r'(?m)^\s+(?!noreply|postmaster)(\w+)',
... '''
...     sales@test.com
...     postmaster@test<span style="font-family:LucidaSansTypewriterStd;">.</span>com
...     noreply@test.com
...     admin@test.com
... ''')
['sales', 'admin']
>>>

非贪婪模式

>>>> re.match('\w+\.(\w+\.)?\w+\.com', 'www.google.com').group()  #非贪婪模式
'www.google.com'
>>> re.match('\w+\.(\w+\.)?\w+\.com', 'www.hk.google.com').group()
'www.hk.google.com'
>>> re.match('\w+\.(\w+\.)?\w+\.com', 'www.ch.sh.google.com').group() 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> re.match('\w+\.(\w+\.)*\w+\.com', 'www.ch.sh.pd.google.com').group()
'www.ch.sh.pd.google.com'
>>>