Python正则表达式

最新推荐文章于 2024-11-13 23:26:55 发布

数据科学家修炼之道

最新推荐文章于 2024-11-13 23:26:55 发布

阅读量186

点赞数

文章标签： Python re

特殊的字符

字符	含义
.	点, 匹配任意一个字符, 除了\n
^	插入字符, 匹配字符串的开头, 在多行模式下也会在换行后立即开始匹配
$	匹配字符串的结尾, 或者开始新行之前的字符串的结尾
*	0到无穷个
+	1到无穷个
?	0或1个
{m}	匹配m次
{m,n}	匹配m到n次

|既可以作为一个字符, 又可以作为转义字符的一部分
[]|匹配括号中的任意一个都行(特殊字符将会在[]中失去它的特殊意义而表示纯字符, 但是\w, \s等还是支持的; []里的^匹配非当前字符; 要在[]里匹配]需加反斜杠但如果]是第一个字符则不用加, [不需要)
A | B|匹配A或B

*+?都是贪婪模式, 尽可能匹配多的字符串, 而在这几个字符之后加上一个?则变成了非贪婪模式, 匹配尽可能少的字符串

import re
pattern = re.compile('123')
pattern.findall('123 123456 1237789 123123123')

特殊的序列

序列	含义
\number	匹配第n个组(即\8等价于第8个括号里的内容, 从1开始计)
\A	只匹配字符串的开始
\b	匹配\w和\W之间
\B	[^\b]
\d	数字
\D	非数字
\s	空白字符如\n \t 空格等
\S	非空白字符
\w	单词字符[A-Za-z0-9_]
\W	非单词字符

详情请参考

使用re模块

Python通过re模块提供对正则表达式的支持。使用re的一般步骤是先将正则表达式的字符串形式编译为Pattern实例，然后使用Pattern实例处理文本并获得匹配结果（一个Match实例），最后使用Match实例获得信息，进行其他的操作

使用pattern将正则表达式的字符串形式编译为pattern实例

import re
pattern = re.compile('abc', re.IGNORECASE)
re.compile('a', r)

函数compile用于将字符串编译成pattern, 具体使用方法:
re.compile(string, flag)

string: 待匹配的字符串('abc')
flag: 可选参数('re.I')

re可用的的flag有:

flag	含义
re.I	忽略大小写
re.IGNORECASE	同上
re.S	改变点的匹配模式, 改为匹配任意字符包括换行
re.DOTALL	同上
re.L	使预定字符类 \w \W \b \B \s \S 取决于当前区域设定 ????? TODO
re.LOCALE	同上
re.U	使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
re.UNICODE	同上
re.X	详细模式, 这个模式下可以是多行的, 且可以加入注释
re.VERBOSE	同上

# 以下两个是等价的
a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits
                   """, re.X)
b = re.compile(r"\d+\.\d*")

re中的查找匹配函数

re.search(pattern, string, flag)

import re
pattern = re.compile('abc')
search1 = re.search(pattern, 'dfghabcsdfgabcdsfgabcsdf')
search2 = re.search(pattern, 'reytyqertihjoaeipurhg')
print('当有结果时返回的类型: %s' % type(search1))
print('当无结果时返回的类型: %s' % type(search2))
print('匹配到的开始位置: %s' % search1.start())
print('匹配到的结尾位置: %s' % search1.end())

re.match(pattern, string, flag)

# coding=utf-8
"""即使在多行模式下, match也只匹配第一行的开始而非每一行的开始"""
import re
match1 = re.match('123', '890237458907123345u8989123')
match2 = re.match('123', '1232123sdjihfghusd2398sdae')
print('匹配到开头则返回类型: %s' % type(match2))
print('未匹配到开头则返回类型: %s' % type(match1))

re.split(pattern, string, maxsplit=0, flags=0)

import re
print(re.split('[!@#$%]', 'heh2!p;hjikd&hkjh#$h;h$;hh!;hj'))

re.findall(pattern, string, flags=0)

找到所有符合的字符串, 返回列表的形式

re.finditer()

找到所有符合的字符串, 返回迭代器(Match对象的迭代器)

re.sub(pattern, replace, string)

替换字符串中正则匹配到的子串

Python批量替换字符串

import re

print(re.sub(re.compile('\n| |\t'), '', '123 \t456\n 789'))  # 123456789