正则表达式实战指南-优快云博客

本文深入解析正则表达式的使用原则与技巧，包括唯一性和准确性要求，详细讲解re模块的功能，如compile、findall、split等，并通过实例演示如何应用正则表达式进行字符串匹配、分割与替换。

正则表达式使用要求
1.唯一性：正则表达式只能匹配目标类别字符串，
而不是匹配其他内容
2.准确性：尽可能全面的考虑目标类别的字符串
特征，做到不遗漏

re模块

re.compile(pattern,flags=0)
功能：生成正则表达式对象
参数：pattern 正则表达式
flags 功能标志位提供更丰富的筛选功能
返回值：返回正则表达式对象

compile 函数返回值的属性函数和 re模块属性函数有相同的部分

In [82]: re.findall(r’abc’,‘abcdef’)
Out[82]: [‘abc’]

In [83]: regex = re.compile(r’abc’)

In [84]: regex.findall(‘abcdef’)
Out[84]: [‘abc’]

re.findall(pattern,string,flags)
功能：查找正则表达式匹配内容
参数：pattern 正则表达式
string 目标字符串
flags 功能标志位
返回值：将匹配到的内容放入一个列表返回
如果有子组，只能返回子组匹配到的内容

import re

pattern = r’ab+’

pattern = r’(ab)+’

l = re.findall(pattern,‘abbacajabbbb’)

pattern = r’(ab)cd(ef)+’
l = re.findall(pattern,‘abcdef==abcdef’)

print(l)

[‘abb’, ‘abbbb’]

[‘ab’, ‘ab’]

[(‘ab’, ‘ef’), (‘ab’, ‘ef’)]

regex.findall(string=None,pos=0,endpos=999)
功能：查找正则表达式匹配内容
参数：string 目标字符串
pos 匹配的目标字符串的起始位置
endpos 匹配的目标字符串的结束位置
返回值：将匹配到的内容放入一个列表返回
如果有子组，只能返回子组匹配到的内容

import re

pattern = r’(ab)cd(ef)+’

使用re调用findall

l = re.findall(pattern,‘abcdef==abcdef’)

print(l)

使用正则对象调用findall

regex = re.compile(pattern)
l = regex.findall(‘abcdef==abcdef’,0,6)
print(l)

[(‘ab’, ‘ef’), (‘ab’, ‘ef’)]

[(‘ab’, ‘ef’)]

re.split(pattern,string,flags=0)
功能：通过正则表达式分割字符串
参数：pattern 正则表达式
string 目标字符串
返回值：返回分隔后的字符串列表

re.sub(pattern,replaceStr,string,max,flags)
功能：用目标字符串替换正则表达式匹配内容
参数：pattern 正则表达式
replaceStr 要替换的内容
string 要匹配的目标字符串
max 最多替换几处
返回值：返回替换后的字符串

re.subn(pattern,replaceStr,string,max,flags)
功能：用目标字符串替换正则表达式匹配内容
参数：pattern 正则表达式
replaceStr 要替换的内容
string 要匹配的目标字符串
max 最多替换几处
返回值：返回替换后的字符串和实际替换的个数

import re

pattern = r’(ab)cd(ef)+’

使用re调用findall

l = re.findall(pattern,‘abcdef==abcdef’)
print(l)

使用正则对象调用findall

regex = re.compile(pattern)
l = regex.findall(‘abcdef==abcdef’,0,6)
print(l)

通过匹配到得内容分割字符串

l = re.split(r’\s+’,‘hello world\nnihao china’)
print(l)

替换目标字符串

s = re.sub(r’\s+’,’##’,‘hello world’)
print(s)
s = re.sub(r’\s+’,’##’,‘hello world nihao’)
print(s)
s = re.sub(r’\s+’,’##’,‘hello world nihao’,1)
print(s)

返回替换目标字符串和实际替换的个数

s = re.subn(r’\s+’,’##’,‘hello world nihao’)
print(s)

[(‘ab’, ‘ef’), (‘ab’, ‘ef’)]

[(‘ab’, ‘ef’)]

[‘hello’, ‘world’, ‘nihao’, ‘china’]

hello##world

hello##world##nihao

hello##world nihao

(‘hello##world##nihao’, 2)

re.finditer(pattern,string,flags)
功能：同findall 使用正则表达式匹配内容
参数：pattern 正则表达式
string 目标字符串
返回值：返回匹配到的内容的

re.fullmatch(pattern, string, flags)
功能：完全匹配目标字符串
要求目标字符串能够被正则表达式完全匹配
参数：pattern 正则表达式
string 目标字符串
返回值： match对象匹配不到返回None

re.match(pattern, string, flags)
功能：匹配目标字符串的开头
参数：pattern 正则表达式
string 目标字符串
返回值： match对象匹配不到返回None

re.search(pattern, string, flags)
功能：匹配目标字符串
同match 只是可以匹配任意位置，只能匹配一处
参数：pattern 正则表达式
string 目标字符串
返回值： match对象匹配不到返回None

*由于fullmatch match search函数匹配不到会返回
None，而None没有match对象属性，所以往往需要异常
判断处理

正则对象其他属性

flags 标志位
pattern 正则表达式
group 有多少个子组
groupindex 捕获组形成字典，
组名为键，第几组为值

In [85]: regex = re.compile(r’abc’)

In [86]: regex.flags
Out[86]: 32

In [87]: regex.pattern
Out[87]: ‘abc’

In [88]: regex=re.compile(r’(ab)cd(ef)’)

In [89]: regex.groups
Out[89]: 2

In [90]: regex.groupindex
Out[90]: mappingproxy({})

In [91]: regex=re.compile(r’(?Pab)cd(ef)’)

In [92]: regex.groupindex
Out[92]: mappingproxy({‘dog’: 1})

作业：读取一个文本，将文本中所有以
大写字母开头的单词匹配出来

创建test.txt文件
Hello world
China
This is a test File.

In [1]: f = open(’./test.txt’,‘r’)

In [2]: import re

In [3]: re.findall(r’[A-Z]\w’,f.read())
Out[3]: [‘He’, ‘Ch’, ‘Th’, ‘Fi’]

  熟练元字符的使用

回顾：
正则表达式作用
元字符使用
正则表达式的转义贪婪分组
re模块操作正则表达式

match对象属性变量

pos 目标字符串的开头位置
end 目标字符串的结束位置
re 正则表达式对象
string 目标字符串
lastgroup 最后一组名字
lastindex 最后一组是第几组

属性方法

start() 获取匹配到的内容的开始位置
end() 获取匹配到的内容的结束位置
span() 获取匹配到的内容开始和结束位置

group(n=0)
功能：获取match对象对应匹配到的内容
参数：默认为0表示获取正则表达式整体的匹配内容
如果赋值1,2,3…则表示获取某个子组的匹配内容
返回值：返回匹配字符串

groups() 获取所有子组匹配内容
groupdict() 将所有捕获组内容形成一个字典

演示match对象

import re

regex = re.compile(r’(ab)cd(ef)(?P)’)
match_obj = regex.search(‘abcdefghij’)

目标字符串的起始位置

print(match_obj.pos)
print(match_obj.endpos)

0

10 获取原始的正则表达式对象和目标字符串

print(match_obj.re)
print(match_obj.string)

re.compile(’(ab)cdef(?P)’)

abcdefghij

获取最后一组的信息

print(match_obj.lastgroup)
print(match_obj.lastindex)

dog

2

#　获取匹配内容的起止位置
print(match_obj.start())
print(match_obj.end())
print(match_obj.span())

0

6 (0, 6)

获取具体匹配内容

print(match_obj.group())
print(match_obj.group(2))

abcdef

ef

获取子组内容

print(match_obj.groups())
print(match_obj.groupdict())

(‘ab’, ‘ef’, ‘’)

{‘dog’: ‘’}

flags 参数

re.compile
re.findall
re.search
re.match
re.finditer
re.fullmatch
re.sub
re.subn
re.split

作用：辅助正则表达式，丰富匹配结果

A = ASCII
S = DOTALL 元字符.可以匹配\n
I = IGNORECASE 忽略大小写
L = LOCALE
M = MULTILINE 元字符^ $ 可以匹配每一行的开头结尾位置
S = Scanner
T = TEMPLATE
U = UNICODE
X = VERBOSE 可以给正则添加注释
同时使用多个flag用|
re.I|re.S