day14 正则表达式

最新推荐文章于 2025-12-19 14:09:09 发布

原创最新推荐文章于 2025-12-19 14:09:09 发布 · 257 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#正则表达式 #python #开发语言

正则表达式

一种可以让复杂的字符串边的简单的工具

1、re 模块

from re import fullmatch, match, search, findall, finditer, sub, split

fullmatch(正则, 字符串) — 匹配整个字符串(判断整个字符串是否符合正则描述的规则)，匹配成功返回匹配对象，匹配失败返回 None

match(正则, 字符串) — 匹配字符串开头(判断字符串开头是否符合正则描述的规则)，匹配成功返回匹配对象，匹配失败返回 None

search(正则, 字符串) — 匹配字符串中第一个满足正则的子串，匹配成功返回匹配对象，匹配失败返回 None

findall(正则, 字符串) — 获取字符串中所有满足正则的子串，返回值是一个列表，列表中的元素是匹配到字符串

finditer(正则, 字符串) — 获取字符串中所有满足正则的子串，返回值是一个迭代器，迭代器中的元素是匹配对象

split(正则, 字符串) — 将字符串中所有满足正则的子串作为切割点对字符串进行切割

sub(正则, 字符串1, 字符串2) — 将字符串2中所有满足正则的子串都替换成字符串1

result = fullmatch(r'abc', 'abc')
print(result)       # <re.Match object; span=(0, 3), match='abc'>

print(match(r'ab', 'abaaaab'))          # <re.Match object; span=(0, 2), match='ab'>

print(search(r'\d{3}', 'sgg789几十739块'))
# <re.Match object; span=(3, 6), match='789'>

str1 = '123jsk5678换手机 789,89ksk890 数 899'
result = findall(r'\d+', str1)
print(result)       # ['123', '5678', '789', '89', '890', '899']

result = finditer(r'\d+', str1)
print(result)       # <callable_iterator object at 0x0000018C1B0A2A30>
print(next(result)) # <re.Match object; span=(0, 3), match='123'>

result = split(r'\d+', str1)
print(result)       # ['', 'jsk', '换手机 ', ',', 'ksk', ' 数 ', '']

result = sub(r'\d+', 'A', str1)
print(result)   # AjskA换手机 A,AkskA 数 A

2、匹配类符号

(1) fullmatch函数

from re import fullmatch

fullmatch(正则表达式, 字符串) — 判断字符串是否满足正则表达式描述的规则

match(正则表达式，字符串) — 匹配字符串开头

不管使用正则表达式解决什么样的字符串问题，写正则的都是在描述字符串规则

(2) 所有匹配类符号

1）普通符号 —— 在正则表达式中表示本身的字符就是普通符号

result = fullmatch(r'abc', 'abc')
print(result)       # <re.Match object; span=(0, 3), match='abc'>
result1 = fullmatch(r'abc', '1bc')
print(result1)      # None

2）. —— 匹配任意一个字符

result1 = fullmatch(r'.bc', '1bc')  # 以bc结尾的任意3个字符
print(result1)      # <re.Match object; span=(0, 3), match='1bc'>
result2 = fullmatch(r'.bc', 'hbc')
print(result2)      # <re.Match object; span=(0, 3), match='hbc'>

result = fullmatch(r'.a.', 'haj')   # 第二个字符为a的任意三个字符
print(result)       # <re.Match object; span=(0, 3), match='haj'>
result = fullmatch(r'.a.', '+a-')
print(result)       # <re.Match object; span=(0, 3), match='+a-'>

3）\d —— 匹配任意一个数字字符

result = fullmatch(r'a\d\db', 'a23b')   # a开头b结尾中间2个任意数字的四个字符
print(result)

4）\s —— 匹配任意一个空白字符（空格、\t、\n）

result = fullmatch(r'a\sb', 'a b')
print(result)   # <re.Match object; span=(0, 3), match='a b'>
result = fullmatch(r'a\sb', 'a\tb')
print(result)   # <re.Match object; span=(0, 3), match='a\tb'>
result = fullmatch(r'a\sb', 'a\nb')
print(result)   # <re.Match object; span=(0, 3), match='a\nb'>

5）\w —— 匹配任意一个数字、字母、下划线或者中文

result = fullmatch(r'a\wb', 'a1b')
print(result)   # <re.Match object; span=(0, 3), match='a1b'>
result = fullmatch(r'a\wb', 'afb')
print(result)   # <re.Match object; span=(0, 3), match='afb'>
result = fullmatch(r'a\wb', 'a_b')
print(result)   # <re.Match object; span=(0, 3), match='a_b'>
result = fullmatch(r'a\wb', 'a我b')
print(result)   # <re.Match object; span=(0, 3), match='a我b'>
result = fullmatch(r'a\wb', 'a@b')
print(result)   # None

6）\D、\S、\W

\ 后面跟大写字母对应的功能和\ 后面跟小写的功能完全相反

\D —— 匹配任意一个非数字字符
\S —— 匹配任意一个非空白字符（空格、\t、\n）
\W —— 匹配任意一个非数字、字母、下划线或者中文的字符

result = fullmatch(r'\Da\Sb\W', '@a_b^')
print(result)   # <re.Match object; span=(0, 5), match='@a_b^'>
result = fullmatch(r'\Da\Sb\W', '1a_b^')
print(result)   # None
result = fullmatch(r'\Da\Sb\W', '@a b^')
print(result)   # None
result = fullmatch(r'\Da\Sb\W', '@a_b我')
print(result)   # None

7）[字符集] —— 匹配在字符集中的任意一个字符

[amx] —— 匹配a、m、x中的任意一个
[\dmn] —— 匹配任意一个数字或者m或者n（\开头的特殊字符在中括号中同样有效）
[3-9] —— 匹配3到9中任意一个数字
[a-z] —— 匹配任意一个小写字母
[A-Z] —— 匹配任意一个大写字母
[a-zA-Z] —— 匹配任意一个字母
[\u4e00-\u9fa5] —— 匹配任意一个中文

注意︰中括号中的 - 只有放在两个字符之间才能表示谁到谁，如果放在中括号的最前面或者最后面，或者放在中括号外面减号就只表示减号本身

result = fullmatch(r'a[\dsj]b', 'asb')
print(result)   # <re.Match object; span=(0, 3), match='asb'>
result = fullmatch(r'a[\dsj]b', 'a4b')
print(result)   # <re.Match object; span=(0, 3), match='a4b'>
result = fullmatch(r'a[\dsj]b', 'acb')
print(result)   # None
print('~~~~~~~~~~~~~~~~分割线~~~~~~~~~~~~~~~~')

result = fullmatch(r'a[4-6]b', 'a5b')
print(result)   # <re.Match object; span=(0, 3), match='a5b'>
result = fullmatch(r'a[4-6]b', 'a3b')
print(result)   # None
print('~~~~~~~~~~~~~~~~分割线~~~~~~~~~~~~~~~~')

result = fullmatch(r'a[a-z]b', 'ahb')
print(result)   # <re.Match object; span=(0, 3), match='ahb'>
result = fullmatch(r'a[A-Z]b', 'aHb')
print(result)   # <re.Match object; span=(0, 3), match='aHb'>
result = fullmatch(r'a[A-Z]b', 'ahb')
print(result)   # None
result = fullmatch(r'a[a-zA-Z]b', 'ahb')
print(result)   # <re.Match object; span=(0, 3), match='ahb'>
result = fullmatch(r'a[a-zA-Z]b', 'aHb')
print(result)   # <re.Match object; span=(0, 3), match='aHb'>
result = fullmatch(r'a[a-zA-Z]b', 'a6b')
print(result)   # None
print('~~~~~~~~~~~~~~~~分割线~~~~~~~~~~~~~~~~')

result = fullmatch(r'a[\u4e00-\u9fa5]b', 'a哈b')
print(result)   # <re.Match object; span=(0, 3), match='a哈b'>

result = fullmatch(r'a[-AZ]b', 'a-b')
print(result)   # <re.Match object; span=(0, 3), match='a-b'>
result = fullmatch(r'A-Z', 'A-Z')
print(result)   # <re.Match object; span=(0, 3), match='A-Z'>

8）[^字符集] —— 匹配不在字符集中的任意一个字符

[] 中的 ^ 在最前面才有特殊功能，不在最前面的话只表示 ^本身字符

result = fullmatch(r'a[^A-Z]b', 'ahb')
print(result)   # <re.Match object; span=(0, 3), match='ahb'>
result = fullmatch(r'a[^A-Z]b', 'a*b')
print(result)   # <re.Match object; span=(0, 3), match='a*b'>
result = fullmatch(r'a[^A-Z]b', 'aHb')
print(result)   # None

3、匹配次数相关符号

1）* —— 0次或者多次（任意多次）

匹配类符号* —— 0个或多个匹配符号

print(fullmatch(r'a*b', 'aaaaab'))              # <re.Match object; span=(0, 6), match='aaaaab'>
print(fullmatch(r'a*b', 'b'))                   # <re.Match object; span=(0, 1), match='b'>
print(fullmatch(r'\d*b', '1264b'))              # <re.Match object; span=(0, 5), match='1264b'>
print(fullmatch(r'.*b', 'mk264b'))              # <re.Match object; span=(0, 6), match='mk264b'>
print(fullmatch(r'[A-Z]*b', 'ASJDHDb'))         # <re.Match object; span=(0, 7), match='ASJDHDb'>

2）+ —— 1次或多次

print(fullmatch(r'a+b', 'aaaaab'))      # <re.Match object; span=(0, 6), match='aaaaab'>
print(fullmatch(r'a+b', 'ab'))          # <re.Match object; span=(0, 2), match='ab'>
print(fullmatch(r'a+b', 'b'))           # None

3）? —— 0次或者1次

print(fullmatch(r'a?b', 'ab'))          # <re.Match object; span=(0, 2), match='ab'>
print(fullmatch(r'a?b', 'b'))           # <re.Match object; span=(0, 1), match='b'>
print(fullmatch(r'a?b', 'aaaaab'))      # None

4）{}

{N} —— 出现N次
{M,N} —— M到N次
{M,} —— 最少M次
{,N} —— 最多N次

print(fullmatch(r'a{5}b', 'aaaaab'))        # <re.Match object; span=(0, 6), match='aaaaab'>
print(fullmatch(r'a{5}b', 'aaab'))          # None
print(fullmatch(r'a{3,6}b', 'aaaaab'))      # <re.Match object; span=(0, 6), match='aaaaab'>
print(fullmatch(r'a{3,6}b', 'aab'))         # None
print(fullmatch(r'a{3,}b', 'aaaaab'))       # <re.Match object; span=(0, 6), match='aaaaab'>
print(fullmatch(r'a{3,}b', 'ab'))           # None
print(fullmatch(r'a{,3}b', 'b'))            # <re.Match object; span=(0, 1), match='b'>
print(fullmatch(r'a{,3}b', 'aaaaab'))       # None

5）贪婪和非贪婪

在匹配次数不确定的时候，匹配模式分为贪婪和非贪婪两种，默认是贪婪

贪婪：在能匹配成功的情况下，有多种匹配次数，贪婪取次数最多的那个次数 (+、*、?、{M,N}、{M,}、{,N})

非贪婪：在能匹配成功的情况下，有多种匹配次数，非贪婪取次数最少的那个次数（(+?、*?、??、{M,N}?、{M,}?、{,N}?)）

print(match(r'ab', 'abaaaab'))          # <re.Match object; span=(0, 2), match='ab'>

print(match(r'a.+b', 'a2bdhkb是b'))      # <re.Match object; span=(0, 9), match='a2bdhkb是b'>
print(match(r'a.+?b', 'a2bdhkb是b'))     # <re.Match object; span=(0, 3), match='a2b'>

4、分组和分支

(1) 分组

() —— 在正则表达式中可以用()将正则表达式中的部分内容括起来表示一个整体，一个()代表一个分组

1）整体控制 —— 以某个部分为单位对符号进行控制

print(fullmatch(r'([a-z]{2}\d{3})+', 'nm123hj456ty789'))
# <re.Match object; span=(0, 15), match='nm123hj456ty789'>

2）重复 —— 在正则中可以通过\N来重复她前面第N个分组匹配到的结果

print(fullmatch(r'(\d{3})-\1', '234-234'))
# <re.Match object; span=(0, 7), match='234-234'>
print(fullmatch(r'(\d{3})-\1', '234-784'))
# None
print(fullmatch(r'([a-z]{2})(\d{3})=\2\1{2}\2', 'nm891=891nmnm891'))
# <re.Match object; span=(0, 16), match='nm891=891nmnm891'>

3）捕获 —— 获取匹配结果的某个部分

自动捕获和手动捕获，只有findall具有自动捕获的功能，其他情况需要手动捕获

from re import findall
message = '年龄: 18岁, 身高: 170, 体重: 120斤, 月薪: 3000元, 房租: 1000元'

result = findall(r'\d+', message)
print(result)   # [ '18','170','120' , '3000', '1000']

result = findall(r'\d+元', message)
print(result)   # ['3000元', '1000元']

# 自动捕获
result = findall(r'(\d+)元', message)
print(result)   # ['3000', '1000']

# 手动捕获
result = fullmatch(r'(\d+)元', '7892元')
print(result)   # <re.Match object; span=(0, 5), match='7892元'>

# 获取整个正则对应的匹配结果：匹配对象.group()
print(result.group())   # 7892元

# 获取某个分组对应的匹配结果：匹配对象.group(分组数)
print(result.group(1))  # 7892

result = fullmatch(r'([a-z]{2})(\d{3})=\2\1{2}\2', 'nm891=891nmnm891')
print(result)           # <re.Match object; span=(0, 16), match='nm891=891nmnm891'>
print(result.group())   # nm891=891nmnm891
print(result.group(1))  # nm
print(result.group(2))  # 891

result1 = findall(r'(\d+)', result.group())
print(result1)      # ['891', '891', '891']

(2) 分支

| —— 分支

正则1|正则2 —— 先用正则1匹配，如果匹配成功整个正则表达式就匹配成功；如果失败就用正则2匹配

print(fullmatch(r'\d{2}([a-z]{3}|[\u4e00-\u9fa5]{3})', '12hjk'))
# <re.Match object; span=(0, 5), match='12hjk'>
print(fullmatch(r'\d{2}([a-z]{3}|[\u4e00-\u9fa5]{3})', '12好几度'))
# <re.Match object; span=(0, 5), match='12好几度'>
print(fullmatch(r'\d{2}([a-z]{3}|[\u4e00-\u9fa5]{3})', '12h几个'))
# None

5、检测类符号

在匹配成功的情况下，检测指定的位置是否符合相应的要求（不影响字符串长度）

1）\b —— 检测\b所在的位置是否是单词边界

单词边界：英文符号中可以区分出不同单词的符号（空白符号、英文标点符号、字符串开头、字符串结尾）

print(fullmatch(r'\d{2}\ba', '23a'))    # None
print(fullmatch(r'\d{2},\ba', '23,a'))  # <re.Match object; span=(0, 4), match='23,a'>

str1 = '123jsk5678换手机 789,89ksk890 数 899'
print(findall(r'\d+', str1))            # ['123', '5678', '789', '89', '890', '899']
print(findall(r'\b\d+', str1))          # ['123', '789', '89', '899']
print(findall(r'\d+\b', str1))          # ['789', '890', '899']
print(findall(r'\b\d+\b', str1))        # ['789', '899']

2）^ —— 检测是否是字符串开头

3）$ —— 检测是否是字符串结尾

search(正则, 字符串) —— 获取字符串中第一个满足条件的子串

print(search(r'\d{3}', 'sgg789几十739块'))
# <re.Match object; span=(3, 6), match='789'>
print(search(r'^1[3-9]\d{9}$', '13678992310'))
# <re.Match object; span=(0, 11), match='13678992310'>

4）转义符号 —— 在正则中有特殊意义的符号前加 \，让这个符号变成普通符号

# 个.456
print(fullmatch(r'[\u4e00-\u9fa5]\.\d+', '和.233'))
# <re.Match object; span=(0, 5), match='和.233'>

# 2*4
print(fullmatch(r'\d\*\d', '2*3'))  # <re.Match object; span=(0, 3), match='2*3'>