python正则表达式模块re
正则表达式对象:re.compile()返回
被匹配的字符串 group()、group(0)、groups()返回 ,group(N)【N>=1】获取分组字符串
匹配的开始位置 start()返回
匹配的结束位置 end()返回
匹配的起止位置 span()返回 ---【元组--(开始,结束)】
正则表达式--模式:
^:匹配字符串的开头
$:匹配字符串的结尾
.:匹配任意字符,不包括换行符。除非设置re.DOTALL或者re.S模式
[...]:表示一组字符单独列出匹配---[123]匹配1、2、3
[^...]:不在[]中的字符---[^123]匹配除去1,2,3外的字符
re*:匹配0或者多个表达式,从头部开始 贪婪匹配
re+:匹配1或者多个表达式 贪婪匹配
re?:匹配0个或者1个由前边正则表达式定义的片段,非贪婪模式
re{n}:精确匹配前面n个表达式---1{2}匹配2113
re{n,}:匹配前面n个表达式---1{2,}匹配211113 1{1,}==1+,1{0,}==1*
re{n,m}:匹配n到m次前边正则表达式定义的片段,贪婪模式
a|b:匹配a或者b
(re):匹配括号内的正则表达式,算一个分组
(?ims):正则表达式包含三种可选模式,针对括号内的区域
(?-ims):正则表达式关闭三种可选模式,针对括号内的区域
(?:re):类似于(...),不算一个分组
(?ims:re):在括号内中使用三种可选模式
(?-ims:re):关闭括号内的三种可选模式
(?#...):注释
(?=re):前向肯定界肯定符
(?!re):前向否定界定符,与肯定符相反,当所含表达式不能再当前位置匹配时成功
(?>re):匹配的独立模式,省去回溯
\w:匹配字母数字及下划线
\W:省去非字母数字及下划线
\s:匹配任意空白字符===[\t\r\n\f]
\S:匹配任意非空字符
\d:匹配任意数字===[0-9]
\D:匹配任意非数字
\A:匹配字符串开始
\Z:匹配字符串结束,如果换行,则匹配到换行前的结束字符
\z:匹配字符串结束
\G:匹配最后匹配完成的位置
\b:匹配空字符串---只在单词开始或结尾的位置
\B:匹配空字符串---不能在词的开头或者结尾
\n:匹配一个换行符
\t:匹配一个制表符
\1...\9:匹配第n个分组的内容
\10:匹配第n个分组的内容,如果可以匹配。否则表示八进制字符表达式
正则表达式-flag--可选:
re.I:忽略大小写,不敏感匹配
re.L:本地化匹配--local-aware
re.M:多行匹配,影响^和$
re.S:使.匹配换行在内的所有字符
re.U:根据unicode字符集解析字符,影响\W、\w、\B、\b
re.X:【更灵活的方式将表达式写的更容易理解】???
运算符优先级:
1.\ 转义符
2.()[] 圆括号 方括号
3.*、+、?、{n}、{n,}、{n,m} 限定符
4.^、$、元字符、任何字符 定位点和序列
5.| 或操作字符
模式测试:
# search函数匹配到取值【从左侧开始取自最先匹配到的】,取所有匹配信息用findall函数--【list】
# 匹配不到返回None
import re
st = "11243455671\n2"
print(st)
# ^表示匹配位置从头开始,操作后要跟匹配字符
print(re.search('^1', "1231", re.S).group(), re.search('^1', "1231", re.S).span())
# $表示匹配位置从尾开始,操作前要跟匹配字符
print(re.search('.$', "re_$ is test", re.S).group(), re.search('.$', "re_$ is test", re.S).span())
# .点匹配任意一个字符,re.S或者re.DOTALL可匹配换行符\n
print(re.search('hon.', "python\nneed", re.S).group(), re.search('hon.', "python\nneed", re.S).span())
print(re.search('hon.', "python\nneed")) # 不加re.S,匹配为空None 不可使用group或者span,否则报错
# [...]匹配字符集
# -连接两个字符[a-z]匹配任何小写ASCII字符 [0-9][0-9]匹配从00到99的两位数字
# -连接两个字符[a-z][A-Z]匹配任何大小写字符
# [0-9][0-9]匹配从00到99的两位数字
# [0-9A-Fa-f] 将匹配任何十六进制数位 ???
# \- 转义后或者它的位置在首位或者末尾[-x]\[x-] 表示普通字符
# 特殊字符在集合中,失去它的特殊含义。比如[(+*)]只会匹配这几个文法字符 '(', '+', '*', ')'
# 不在集合范围内的字符可以通过取反^来进行匹配 [^a]--除去a字符 [^^]--出去^字符,不在首位为普通字符
# 要匹配字符‘]’可以通过转移或者放在集合首位
print(re.search('[a-z]', "12345abc").group(), re.search('[a-z]', "12345abc").span())
print(re.findall('[a-z]', "12345abc"))
print(re.search('[0-9][0-9]', "1a23123c").group(), re.search('[a-z]', "1a23123c").span())
print(re.findall('[0-9][0-9]', "1a23123c"))
# print(re.findall('[0-9A-Fa-f]', "8c4f5d"))
print(re.findall('[a\-c]', "12345abc"))
print(re.findall('[-c]', "12345abc-"))
print(re.findall('[a-]', "12345abc-"))
print(re.findall('[(a*?)]', "(12345abc-*+)"))
print(re.findall('[^^ab12]', "12345^abc"))
print(re.findall('[{[()\]}]', "{[(abc123)]}"))
print(re.findall('[]{[()}]', "{[(abc123)]}"))
# [^xyz]匹配非x、y、z中的任意一个
print(re.search('[^13ab]', "12345abc").group(), re.search('[^13ab]', "12345abc").span())
# re* 匹配0或者多个表达式【ab* 会匹配 'a','ab',或者'a'后面跟随任意个'b'--官方文档 -a必存在与字符串中
# 测试a*这类出现疑惑,按官方操作。只可以.* 电字符
print(re.search('45*', "abc434556", re.S).group(), re.search('45*', "abc434556", re.S).span())
print(re.search('45*', "abc345d", re.S).group(), re.search('45*', "abc345d", re.S).span())
print(re.search('45*', "abc345555d", re.S).group(), re.search('45*', "abc345555d", re.S).span())
print(re.findall('45*', "abc434556"))
# re+ 匹配1或者多个表达式【ab+ 会匹配'a'后面跟随1个以上到任意个'b',它不会匹配'a'】---官方文档 -a必存在与字符串中
# a* 这类也可以操作
print(re.search('45+', "abc434556", re.S).group(), re.search('45+', "abc434556", re.S).span())
# re? 匹配0到1次【ab?会匹配'a'或者'ab'】---官方文档 -a必存在与字符串中
print(re.search('45?', "abc434556").group())
print(re.search('45?', "abc4556").group())
# *? +? ?? 【'*','+',和'?'修饰符都是贪婪的,在修饰符之后添加?将使样式以非贪婪方式】
print(re.search('<.*>', "<aa> <b> <c>").group())
print(re.search('<.*?>', "<aa> <b> <c>").group())
print(re.search('34*', "123445").group())
print(re.search('34*?', "123445").group())
# re{m} 精确匹配m次【a{2} 将匹配2个'a'】---->测试只可以是字符,字符串不可以的
print(re.search("a{2}", '11aAa1aa1', re.I).group(), re.search("a{2}", '11aAa1aa1', re.I).span())
print(re.search("[abc]{2}", '11aAa1aa1', re.I).group(), re.search("[abc]{2}", '11aAa1aa1', re.I).span())
# re{m,n} 精确匹配m次【a{3,5} 将匹配3到5个'a'】---->测试只可以是字符,字符串不可以的
print(re.search("a{3,5}", '11aAa1aa1', re.I).group(), re.search("a{3,5}", '11aAa1aa1', re.I).span())
print(re.search('[abc]{3,5}', "11aAa1aa1", re.I).group(), re.search('[abc]{3,5}', "11aAa1aa1", re.I).span())
# re{m,} 精确匹配m次【a{2,} 将匹配2个'a'】---->测试只可以是字符,字符串不可以的
print(re.search("a{2,}", '11aAa1aa1', re.I).group(), re.search("a{2,}", '11aAa1aa1', re.I).span())
print(re.search('[abc]{2,}', "11aAa1aa1", re.I).group(), re.search('[abc]{2,}', "11aAa1aa1", re.I).span())
# \ 转意匹配 ---或者使用r
print("\n3")
print("\\n3")
print(r"\n3")
# | A|B,匹配A或者B,A和B可以是任意正则表达式,非贪婪匹配匹配A则不匹配B,普通字符则转义\|或者在字符集中使用[|]
print(re.findall('arty|123', "abc123456")) # AB无交集 B属于字符串 ---匹配B['123']
print(re.findall('abc123|123', "abc123456")) # B属于A A属于字符串 ---匹配A['abc123']
print(re.findall('abc|abc123', "abc123456")) # A属于B A属于字符串---匹配A['abc']
print(re.findall('abc|23', "abc123456")) # AB无交集 AB属于字符串---匹配AB['abc', '23']
print(re.findall('abc|bc23', "abc123456")) # AB有交集 AB属于字符串---匹配A['abc', '23']
print(re.findall('ab\|', "ab|c"))
print(re.findall('[|]', "ab|c"))
# (...)匹配括号内的任意正则表达式,并标识出组合的开始和结尾,匹配完成后,组合的内容可以被获取,并可以在之后用 \number 转义序列进行再次匹配
# 要匹配字符 '(' 或者 ')', 用 \( 或 \), 或者把它们包含在字符集合里: [(], [)]
print(re.findall('(abc[(].*)', "abc(123456)"))
print(re.findall('(abc\(.*)', "abc(123456)"))
print(re.findall('(abc.*5)', "abc(123456)"))
# (?<=…)匹配字符串的当前位置,它的前面匹配 … 的内容到当前位置 ---【正向后视断定】去前取后
print(re.findall('(?<=abc)123456', 'abc123456'))
# (?=…)匹配 … 符合的情况 ---lookahead assertion 后边匹配才去匹配前边的
print(re.findall('abc(?=123456)', 'abc123456'))
# (?!…)匹配 … 不符合的情况 ----negative lookahead assertion【前视取反】后边的不匹配才去匹配前边的
print(re.findall('abc(?!123456)', 'abc1234'))
print(re.findall('(.*)(?#23)(56)', "abc123456"))
D:\mypy\venv\Scripts\python.exe D:/mypy/test_re.py
11243455671
2
1 (0, 1)
t (11, 12)
hon
(3, 7)
None
a (5, 6)
['a', 'b', 'c']
23 (1, 2)
['23', '12']
['a', 'c']
['c', '-']
['a', '-']
['(', 'a', '*', ')']
['3', '4', '5', 'c']
['{', '[', '(', ')', ']', '}']
['{', '[', '(', ')', ']', '}']
2 (1, 2)
4 (3, 4)
45 (4, 6)
45555 (4, 9)
['4', '455']
455 (5, 8)
4
45
<aa> <b> <c>
<aa>
344
3
aA (2, 4)
aA (2, 4)
aAa (2, 5)
aAa (2, 5)
aAa (2, 5)
aAa (2, 5)
3
\n3
\n3
['123']
['abc123']
['abc']
['abc', '23']
['abc']
['ab|']
['|']
['abc(123456)']
['abc(123456)']
['abc(12345']
['123456']
['abc']
['abc']
[('abc1234', '56')]
Process finished with exit code 0
函数:
re.split(sep): 也是字符串的一个方法,以指定字符串分割字符串,返回一个列表
def split(pattern, string, maxsplit=0, flags=0):
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
return _compile(pattern, flags).split(string, maxsplit)
re.match(pattern,string,flag)
def match(pattern, string, flags=0):
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
return _compile(pattern, flags).match(string)
re.search(pattern, string, flags=0)
def search(pattern, string, flags=0):
Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found.
return _compile(pattern, flags).search(string)
re.sub(pattern, repl, string, count=0, flags=0)
def sub(pattern, repl, string, count=0, flags=0):
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
return _compile(pattern, flags).sub(repl, string, count)
re.compile(pattern, flags=0)
def compile(pattern, flags=0):
"Compile a regular expression pattern, returning a pattern object."
return _compile(pattern, flags)
re.findall(pattern,string,flag)
def findall(pattern, string, flags=0):
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
return _compile(pattern, flags).findall(string)
re.finditer(pattern,string,flag)
def finditer(pattern, string, flags=0):
Return an iterator over all non-overlapping matches in the
string. For each match, the iterator returns a match object.
Empty matches are included in the result.
return _compile(pattern, flags).finditer(string)
函数测试:
import re
a = re.match('aa', "wqaae") # 匹配字符不在首字母返回None
print(a)
a = re.match('aa', "aawqxe") # 匹配字符在首字母开始返回匹配对象
print(a)
a = re.match('aa', "aawqxe").span() # 匹配字符在首字母返回匹配对象的区间--元组(开始,结束)
print(a)
a = re.match('aa', "aaaqxe").start() # 返回匹配字符开始位置
print(a)
a = re.match('aa', "aaaqxe").end() # 返回匹配字符结束位置
print(a)
b = re.search('bb', 'ccbbvdxb').group() # 匹配字符在整个字符串中返回匹配对象
print(b)
c = re.sub('#.*$', ".", "python re.sub is testing #测试code") # 直接返回字符串
print(c)
def changed(matched):
value = int(matched.group('value'))
return str(value * 2)
c = re.sub('(?P<value>\d+)', changed, "python1is2so3func4") # 可为函数
print(c)
patt = re.compile('([a-z,_]+) ([a-z]+)', re.I) # 用于生成正则规则,用于search和match函数
d = patt.match('compile_to_patt test going')
print(d)
print(d.group(0))
print(d.group(1))
print(d.group(2))
dd = patt.search('compile_to_patt test going', 5, 50)
print(dd.group(2))
ddd = patt.findall('compile_to_patt test going') # 返回匹配字符串组成的列表,为匹配成功返回空列表
print(ddd)
e = re.findall('([a-z,_]+) ([a-z]+)', 're_findall test going') # 返回匹配字符串组成的列表
print(e)
f = re.finditer('\d+', 'aa12dd34ff56gg') # 将匹配到的字串作为迭代器返回 **规则不可省略,不可使用compile
print(type(f))
for x in f:
print(x.group(0))
结果:
D:\mypy\venv\Scripts\python.exe D:/mypy/myre.py
None
<_sre.SRE_Match object; span=(0, 2), match='aa'>
(0, 2)
0
2
bb
python re.sub is testing .
python2is4so6func8
<_sre.SRE_Match object; span=(0, 20), match='compile_to_patt test'>
compile_to_patt test
compile_to_patt
test
test
[('compile_to_patt', 'test')]
[('re_findall', 'test')]
<class 'callable_iterator'>
12
34
56
Process finished with exit code 0
实际应用:
import re
g = re.split('\d+', "a1b2c3d4") # 按照匹配字符分割字符串
print(g)
print("1 2 3 4".split())
st = "life is short i learn python 人生苦短 我学派森 1 2 1"
st_to_list = st.split(" ")
print(st_to_list)
re_list = re.findall('[a-zA-Z]+|\d+', st) # 匹配数字和字母
for x in re_list:
st_to_list.remove(x)
list_to_st = ','.join(st_to_list)
print(re_list)
print(list_to_st)
结果:
D:\mypy\venv\Scripts\python.exe D:/mypy/tmp.py
['a', 'b', 'c', 'd', '']
['1', '2', '3', '4']
['life', 'is', 'short', 'i', 'learn', 'python', '人生苦短', '我学派森', '1', '2', '1']
['life', 'is', 'short', 'i', 'learn', 'python', '1', '2', '1']
人生苦短,我学派森
Process finished with exit code 0