Python正则表达式学习

sfdssdf123

已于 2025-02-05 16:34:21 修改

阅读量336

点赞数 3

CC 4.0 BY-SA版权

文章标签：正则表达式学习 python

于 2025-01-24 16:04:29 首次发布

本文链接：https://blog.youkuaiyun.com/sfdssdf123/article/details/145342503

一、规则总结

1.正则表达式语法
   1.1. ？匹配0或1次前面的分组，上限是1
   1.2. * 匹配大于等于0次前面的分组，上限是无限
   1.3. + 匹配大于等于1次前面的分组，上限是无限
   1.4. {n}匹配n次前面的分组
   1.5. {n,}匹配大于等于n次前面的分组
   1.6. {,m}匹配小于等于m次前面的分组
   1.7. {n,m}匹配大于等于n次，小于等于m次前面的分组
   1.8. {n,m}？对前面的分组进行非贪心匹配
   1.9. *？对前面的分组进行非贪心匹配
   1.10. +？对前面的分组进行非贪心匹配
   1.11. ^spam 字符串必须以spam开始
   1.12. spam$ 字符串必须以spam结束
   1.13. 点号.匹配所有字符，除了换行符\n
   1.14. \d匹配数字，\D匹配数字以外的字符
   1.15. \w匹配字母，\W匹配字母以外的字符
   1.16. \s匹配空格，\S匹配空格以外的字符
   1.17. [abc]匹配方括号内的任意字符（例如：a,b,c）
   1.18. [^abc]匹配不在方括号内的任意字符
   1.19. | 匹配任意一个

1.20.re.compile('foo', re.IGNORECASE) 加参数re.IGNORECASE表示匹配时忽略大小写

1.21.re.compile('foo', re.DOTALL) 加参数re.DOTALL表示匹配时忽略换行或回车

1.22.re.compile('foo', re.VERBOSE) 加参数re.VERBOSE表示匹配时忽略正则表达式里面的注释

注：为什么？表示0和1，因为问号就是问是否存在的意思，那回答肯定是存在1或不存在0.所以问号可以表示0和1个。--------自己想的助记方法

2.正则表达式在python代码里如何使用

2.0.使用正则表达式前记得导入 re。

import re

   2.1.re.rearch()正则表达式打括号说明有分组，需要按分组匹配。有分组的匹配结果值是数组，需要用groups()方法接收。
   2.2.re.rearch()正则表达式没有打括号，说明没有分组，需要按全字段匹配。没有分组的匹配结果是单个字符串，需要用group()方法接收。
   2.3.re.findall()可以找所有的匹配,且返回值是数组，且不需要group()接收
3.贪心匹配和非贪心匹配
   默认有歧义时，所有匹配都是贪心匹配。
   如果要非贪心匹配则要加？
4.匹配换行符

二、例子

1.或|的使用

heroRegex=re.compile(r'Batman|Tina Fey')
mo1=heroRegex.search('Batman and Tina Fey Batman')
print(mo1.group())

# 执行结果：
# Batman

2.或|的使用

heroRegex=re.compile(r'Batman|Tina Fey')
mo1=heroRegex.search('Watman and Tina Fey Batman')
print(mo1.group())

# 执行结果
# Tina Fey

3.或|的使用，且带了分组

batRegex=re.compile(r'Bat(man|mobile|copter|bat)')
mo1=batRegex.search('Batcopter lost a Batbat')
print(mo1.group())

# 执行结果
# Batcopte

4.或|的使用，且带了分组

batRegex=re.compile(r'Bat(man|mobile|copter|bat)')
mo1=batRegex.search('Batmanmobile lost a Batbat')
print(mo1.group())
# 执行结果
# Batman

5.或|的使用，且带了分组，只取分组里获取的内容

batRegex=re.compile(r'Bat(man|mobile|copter|bat)')
mo1=batRegex.search('Batmanmobile lost a Batbat')
print(mo1.group(1))
# 执行结果
# man

6. ？精确匹配0个或1个。超过1个就不能匹配了

batRegex=re.compile(r'Bat(wo)?man')
mo=batRegex.search('The Adventures of Batman')
print(mo.group())
mo2=batRegex.search('The Adventures of Batwoman')
print(mo2.group())
mo3=batRegex.search('The Adventures of Batwowoman')
print(mo3==None)

#执行结果
# Batman
# Batwoman
# True

phoneRegex=re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo6=phoneRegex.search('My number is 415-555-4242')
print(mo6.group())
mo7=phoneRegex.search('My number is 999-4242')
print(mo7.group())

# 执行结果
# 415-555-4242
# 999-4242

7.*匹配0个到无数个

batRegex2=re.compile(r'Bat(wo)*man')
mo8=batRegex2.search('The Adventures of Batwowoman')
print(mo8.group())

# 执行结果
# Batwowoman

batRegex2=re.compile(r'Bat(wo)*man')
mo9=batRegex2.search('The Adventures of Batman')
print(mo9.group())
# Batman

8. +匹配1个到无数个

batRegex3=re.compile(r'Bat(wo)+man')
mo2=batRegex3.search('The Adventures of Batman')
print(mo2==None)
# True

batRegex3=re.compile(r'Bat(wo)+man')
mo3=batRegex3.search('The Adventures of Batwoman')
print(mo3.group())
# Batwoman

batRegex3=re.compile(r'Bat(wo)+man')
mo4=batRegex3.search('The Adventures of Batwowoman')
print(mo4.group())
# Batwowoman

9. {n,m}匹配n到m个，且左右都为闭区间。任意一个参数不写则默认为0

batRegex4=re.compile(r'Bat(wo){,2}man')
mo10=batRegex4.search('The Adventures of Batman')
print(mo10.group())
# Batman

batRegex4=re.compile(r'Bat(wo){,2}man')
mo10=batRegex4.search('The Adventures of Batwowoman')
print(mo10.group())
# Batwowoman

batRegex4=re.compile(r'Bat(wo){,2}man')
mo10=batRegex4.search('The Adventures of Batwowowoman')
print(mo10==None)
# True

10. ？非贪心匹配

按最少去匹配

batRegex5=re.compile(r'(wo){2,4}?')
mo11=batRegex5.search('The Adventures of wowowowo')
print(mo11.group())
# wowo

与下面的贪心匹配对比

按最多去做匹配

batRegex5=re.compile(r'(wo){2,4}')
mo11=batRegex5.search('The Adventures of wowowowo')
print(mo11.group())
# wowowowo

11.注意与非贪心匹配区别

虽然此处用了非贪心匹配，但是结果和贪心匹配的结果一样

batRegex6=re.compile(r'Bat(wo){1,4}?man')
mo12=batRegex6.search('The Adventures of Batwowoman')
print(mo12.group())
#Batwowoman

以下用的贪心匹配

batRegex6=re.compile(r'Bat(wo){1,4}man')
mo12=batRegex6.search('The Adventures of Batwowoman')
print(mo12.group())
# Batwowoman

12.search找的是第一个，返回值为字符串。findall可以找全部，返回值为数组

batRegex8=re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # 没有分组
mo14=batRegex8.search('Cell:415-555-9999 Work:222-555-0000')
print(mo14.group())
# 415-555-9999

batRegex8=re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # 没有分组
mo14=batRegex8.findall('Cell:415-555-9999 Work:222-555-0000')
print(mo14)
# ['415-555-9999', '222-555-0000']

以下这个有点不理解为什么不提取短横线 -

batRegex71=re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # 
mo131=batRegex71.findall('Cell:415-555-9999 Work:222-555-0000')
print(mo131)
# [('415', '555', '9999'), ('222', '555', '0000')]

13.findall使用

提取字符：提取1到无数个数字加空格加1到无数个字母

batRegex14=re.compile(r'\d+\s\w+') 
mo15=batRegex14.findall('12 drmmers,11 pipers,10 lords,9 ladies,8 maids,7 swans,6 geese')
print(mo15)
# ['12 drmmers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6 geese']

14.findall使用

batRegex15=re.compile(r'[aeiouAEIOU]') 
mo16=batRegex15.findall('RoboCop eats baby food. BABY FOOD.')
print(mo16)
# ['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

15.findall使用

暂时对此正则表达式有疑问

^作用的是0-5 还是后面的表达式都作用了呢？

batRegex17=re.compile(r'[^0-5A-Za-z.\s]') 
mo18=batRegex17.findall('3RoboCop 4 eats 6baby 1food. 8BABY FOOD.')
print(mo18)
# ['6', '8']

16.findall使用

atRegex = re.compile(r'.at')
mo22=atRegex.findall('The cat in the hat sat on the flat mat.')
print(mo22)
# ['cat', 'hat', 'sat', 'lat', 'mat']

17.search使用

nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo23 = nameRegex.search('First Name: Al Last Name: Sweigart')
print(mo23.group())
print(mo23.group(1))
print(mo23.group(2))
# First Name: Al Last Name: Sweigart
# Al
# Sweigart

18.贪心匹配和非贪心匹配例子再来一个

非贪心

nongreedyRegex = re.compile(r'<.*?>')
mo24 = nongreedyRegex.search('<To serve man> for dinner.>')
print(mo24.group())
# <To serve man>

贪心

nongreedyRegex = re.compile(r'<.*>')
mo24 = nongreedyRegex.search('<To serve man> for dinner.>')
print(mo24.group())
# <To serve man> for dinner.>

19. .*匹配除了换行符以外所有内容

noNewlineRegex = re.compile('.*')
mo26=noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.')
print(mo26.group())
# 执行结果
#  Serve the public trust.



#\n表示折行
print('Serve the public trust.\nProtect the innocent.\nUphold the law.')
# 执行结果
#  Serve the public trust.
#  Protect the innocent.
#  Uphold the law.

20.加上参数 re.DOTALL 匹配换行符

newlineRegex = re.compile('.*', re.DOTALL)
mo27=newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.')
print(mo27.group())
# Serve the public trust.
# Protect the innocent.
# Uphold the law.

21.匹配时忽略字母大小写

batRegex18=re.compile(r'good',re.IGNORECASE) 
mo=batRegex18.search('Hello GOOD world')
print(mo.group())
#GOOD

batRegex18=re.compile(r'good',re.I) 
mo=batRegex18.search('Hello GOOD world')
print(mo.group())
#GOOD

22.正则表达式匹配并替换

namesRegex = re.compile(r'Agent \w+')
mo=namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob ,great Agent .')
print(mo)
#'CENSORED gave the secret documents to CENSORED ,great CENSORED.'

22.正则表达式匹配并替换，按分组内的内容替换

agentNamesRegex = re.compile(r'Agent (\w)(\w)\w*')
mo=agentNamesRegex.sub(r'\1****','Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
print(mo)
#A**** told C**** that E**** knew B**** was a double agent.

注意：

1.替换的字符也是来自正则表达式

2.替换的字符里1和2代表的是取分组，上面的正则表达式正好有两个括号对应的是两个分组。

agentNamesRegex = re.compile(r'Agent (\w)(\w)\w*')
mo=agentNamesRegex.sub(r'\2****','Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
print(mo)
#l**** told a**** that v**** knew o**** was a double agent.

23.复杂正则表达式的书写

使用re.VERBOSE参数，让正则表达式忽略空格、注释。

目的：便于书写复杂的正则表达式，可以给每个小段正则表达式换行和加注释

phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? 			 # area code
(\s|-|\.)? 					 # separator
\d{3} 						 # first 3 digits
(\s|-|\.) 					 # separator
\d{4} 						 # last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})? # extension	
)''', re.VERBOSE)

其中三重引号（'''）代表的是多行字符串

24.正则表达式既要匹配换行 re.DOTALL、又要忽略大小写re.IGNORECASE 、又要编写注释re.VERBOSE

someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)