文章目录
正则表达式re
基本函数
match
re.match尝试从字符串的起始位置匹配一个模式pattern,若不是起始位置匹配成功返回None(注意起始位置即开头 如abc开头必须匹配上abc否则返回None)
一般用于某个字符串是否符合某个正则表达式的规则
re.match(pattern, string, flags=0)
pattern: 匹配的正则表达式
string: 要匹配的字符串
flags: 标志位,用于控制匹配方式,如是否对大小写不敏感re.l;是否多行匹配re.M 即修饰符(影响^和$)
匹配成功返回一个匹配的对象
匹配对象SRE_Match 输出匹配内容 group方法
对匹配对象调用group方法获取指定匹配
group(num=0):匹配的整个表达式的字符串,group() 可以一次输入多个组号,在这种情况下它将返回一个包含那些组所对应值的元组。
(?P<name>)搭配group('name') 自定义组名
groups():返回一个包含所有小组字符串的元组,从 1 到 所含的小组号
一般和()搭配
如 www.baidu.com
match('(w)\w{2}(\s)')
匹配对象span方法 输出匹配范围
返回匹配位置的元组(0,3) 从索引0到索引3
示例:
import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^Hello\s(\d+)\sWorld', content)
print(result)
print(result.group())
print(result.group(1)) # 输出第一个被()包围的匹配结果
print(result.span())
返回
<_sre.SRE_Match object; span=(0, 19), match='Hello 1234567 World'>
Hello 1234567 World
1234567 # 此处是()匹配项
(0, 19) # Hello 1234567 World 位置索引0到索引19
search
re.search扫描整个字符串并返回第一个成功的匹配
函数语法
re.search(pattern, string, flags=0)
pattern: 匹配的正则表达式
string: 要匹配的字符串
flags: 标志位同match
匹配成功re.search方法返回一个匹配的对象,否则返回None
可以使用group(num)或groups()匹配对象函数来获取匹配表达式
group(num = 0) : 匹配的整个表达式的字符串,group()可以一次输入多个组号,在这种情况下它将返回一个包含那些组所对应值的元组
groups() : 返回一个包含所有小组字符串的元组,从1到所含的小组号,如('123456', 'World_this')
实例:
import re
content = 'first line' \
'Hello 123456 World_this is a Regex Demo'
result = re.search('Hello\s(\d+)\s(\w+)', content)
print(result)
print(result.group())
print(result.group(1))
print(result.groups())
打印结果:
<_sre.SRE_Match object; span=(10, 33), match='Hello 123456 World_this'>
Hello 123456 World_this
123456
('123456', 'World_this')
实例2:
import re
line = 'Cats are smarter than dogs'
searchObj = re.search(r'(.*) are (.*?) .*', line, re.M|re.I) # 修饰符 多行匹配;对大小写不敏感
if searchObj:
print(searchObj.groups())
print(searchObj.group())
else:
print('Nothing found!')
打印
('Cats', 'smarter')
Cats are smarter than dogs
若是.*?换成.* 则会贪婪匹配,匹配smarter than
re.search与re.match的区别:
re.match只匹配字符串的开始,若字符串开始不符合正则表达式,则匹配失败,函数返回None;而re.search匹配整个字符串,直到找到一个匹配
检索和替换sub
sub用于替换字符串中匹配项;常用于在正则匹配前处理数据
语法:
re.sub(pattern, repl, string, count=0, flags=0)
pattern: 正则中的模式字符串
repl: 替换的字符串,也可为一个函数
string: 要被查找替换的原始字符串
count: 模式匹配后替换的最大次数,默认0表示替换所有的匹配
flags: 编译时用的匹配模式,数字形式
前三个是必选参数,后两者是可选参数
实例:
import re
phone = '2004-959-559 # 电话号码'
# 删除注释
num = re.sub(r'#.*$', '', phone)
print(num)
# 移除非数字的内容
num = re.sub(r'\D', '', phone)
print(num)
打印
2004-959-559
2004959559
repl参数是一个函数时:
import re
# 将匹配的数字乘以2
def double(matched):
value = int(matched.group('value'))
return str(value * 2)
s = '1A2B3C32G12F'
print(re.sub('(?P<value>\d+)', double, s))
compile
compile函数用于编译正则表达式,生成一个正则表达式Pattern对象,供match()和search()两个函数使用
语法格式:
re.compile(pattern[, flags])
pattern: 一个字符串形式的正则表达式
flags: 可选,匹配模式
re.I 忽略大小写
re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
re.M 多行模式
re.S 即为' . '并且包括换行符在内的任意字符(' . '不包括换行符)
re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
re.X 为了增加可读性,忽略空格和' # '后面的注释
实例:
import re
# 用于匹配至少一个数字
pattern = re.compile(r'\d+')
m = pattern.match('1abc12d3ef45')
pattern对象:
match(string, pos, endpos):
匹配str的起始位置,或匹配start索引到end索引位置
match:
group([group1, …]) 方法用于获得一个或多个分组匹配的字符串,当要获得整个匹配的子串时,可直接使用 group() 或 group(0);
start([group]) 方法用于获取分组匹配的子串在整个字符串中的起始位置(子串第一个字符的索引),参数默认值为 0;
end([group]) 方法用于获取分组匹配的子串在整个字符串中的结束位置(子串最后一个字符的索引+1),参数默认值为 0;
span([group]) 方法返回 (start(group), end(group))。
search(string, pos, endpos)
sub(string, pos, endpos)
findall(string, pos, endpos)
finditer(string, pos, endpos)
findall
在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,若没有找到匹配的,则返回空列表
注意:match与search是匹配一次findall则匹配所有
格式:
re.findall(string[, pos[, endpos]])
string: 待匹配的字符串
pos: 可选参数,指定字符串的起始位置,默认0
endpos: 可选参数,指定字符串的结束为止,默认为字符串的长度
实例:
pattern = re.compile(r'\d+')
result1 = pattern.findall('abc 123 bcd 456')
result2 = pattern.findall('a123bb456c5c789', 0, 10)
print(result1)
print(result2)
结果:
['123', '456']
['123', '456']
pattern:
findall(string, pos, endpos):不包括endpos位置包括pos起始位置
finditer
和findall类似,在字符串中找到正则表达式所匹配的所有子串,并把它们作为一个迭代器返回,即可以遍历
语法:
re.finditer(pattern, string, flags=0)
pattern: 匹配的正则表达式
string: 要匹配的字符串
flags: 标志位
实例:
iter = re.finditer(r'\d+', '12a23b45c')
for num in iter:
print(num.group())
结果:返回的是group的迭代器
12
23
45
split
split方法按照能够匹配的子串将字符串分割后返回列表
语法:
re.split(pattern, string[, maxsplit=0, flags=0])
pattern: 匹配的正则表达式
string: 要匹配的字符串
maxsplit: 分隔次数,maxsplit=1 分隔一次,默认为0,不限制次数
flags: 标志位
实例:
text = 'abc,abc, abc..'
arr = re.split(r'\W+', text)
print(arr)
结果:根据匹配规则\W+即非字母数字下划线分割字符串 ,对于一个找不到匹配的字符串而言,split 不会对其作出分割
['abc', 'abc', 'abc', '']
修饰符
re.I 忽略大小写
re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
re.M 多行模式,影响^和$
re.S 即为'.'并且包括换行符在内的任意字符('.'不包括换行符),即使.匹配包括换行在内的所有字符
re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
re.X 为了增加可读性,忽略空格和' # '后面的注释
网页匹配常用re.I和re.S
正则对象
RegexObject
re.compile() 返回 RegexObject 对象。
MatchObject
group() 返回被 RE 匹配的字符串。
start() 返回匹配开始的位置
end() 返回匹配结束的位置
span() 返回一个元组包含匹配 (开始,结束) 的位置
正则表达式语法
由于正则表达式通常都包含反斜杠,所以你最好使用原始字符串来表示它们。模式元素(如 r’\t’,等价于 \t )匹配相应的特殊字符
常用匹配规则
非打印字符
| 模式 | 说明 |
|---|---|
| \f | 匹配一个换页符。等价于 \x0c 和 \cL。 |
| \n | 匹配一个换行符。等价于 \x0a 和 \cJ。 |
| \r | 匹配一个回车符。等价于 \x0d 和 \cM。 |
| \s | 匹配任何空白字符,包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v] |
| \S | 匹配任何非空白字符。等价于 [^ \f\n\r\t\v] or [^s] |
| \t | 匹配一个制表符。等价于 \x09 和 \cI |
| \v | 匹配一个垂直制表符。等价于 \x0b 和 \cK |
特殊字符
| 模式 | 说明 |
|---|---|
| $ | 匹配输入字符串的结尾位置 |
| () | 标记一个子表达式的开始和结束位置,即匹配括号内的表达式,也表示一个组 |
| * | 匹配前面的子表达式零次或多次 |
| + | 匹配前面的子表达式一次或多次 |
| . | 匹配除换行符 \n 之外的任何单字符 |
| [ | 标记一个中括号表达式的开始 |
| ? | 匹配前面的子表达式零次或一次,或指明一个非贪婪限定符 |
| \ | 将下一个字符标记为或特殊字符、或原义字符、或向后引用、或八进制转义符 |
| ^ | 匹配输入字符串的开始位置,除非在方括号表达式中使用,此时它表示不接受该字符集合 |
| { | 标记限定符表达式的开始 |
| | | 指明两项之间的一个选择 或的关系 a |
限定符
特殊字符
| 模式 | 说明 |
|---|---|
| * | 任意次 |
| + | 至少1次 |
| ? | 0或1次 |
| {n} | 精确匹配n个前面的表达式,即匹配n次表达式 o{2} 匹配food 匹配2次o |
| {n,} | 至少匹配n次 |
| {n,m} | 匹配n到m次前面正则表达式定义的片段,贪婪方式 |
| […] | 用来表示一组字符,单独列出 如[ab] 匹配a,b而非ab |
| [^…] | 不在[]中的字符,如[^ab] 匹配除ab外任意一个字符 |
元字符
| 模式 | 说明 |
|---|---|
| \w | 匹配字母,数字,下划线 |
| \W | 匹配不是字母,数字,下划线的字符 |
| \d | 匹配任意数字等价[0-9] |
| \D | 匹配任意非数字的字符 |
| \A | 匹配字符串开头 |
| \Z | 匹配字符串结尾,若存在换行,只匹配换行前的结束字符串 |
| \z | 同上,但同时会匹配换行符 |
| \G | 匹配最后完成的位置 |
unicode种类匹配
不可见的控制符和未使用的code码
Abb:C Long form:Other
\p{C} or \p{Other}: invisible control characters and unused code points
https://www.regular-expressions.info/unicode.html#prop
http://www.unicode.org/reports/tr18/
In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain category. You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.
Again, “character” really means “Unicode code point”. \p{L} matches a single code point in the category “letter”. If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category “letter”, while U+0300 is in the category “mark”.
You should now understand why \P{M}\p{M}*+ is the equivalent of \X. \P{M} matches a code point that is not a combining mark, while \p{M}*+ matches zero or more code points that are combining marks. To match a letter including any diacritics, use \p{L}\p{M}*+. This last regex will always match à, regardless of how it is encoded. The possessive quantifier makes sure that backtracking doesn’t cause \P{M}\p{M}*+ to match a non-mark without the combining marks that follow it, which \X would never do.
PCRE, PHP, and .NET are case sensitive when it checks the part between curly braces of a \p token. \p{Zs} will match any kind of space character, while \p{zs} will throw an error. All other regex engines described in this tutorial will match the space in both cases, ignoring the case of the category between the curly braces. Still, I recommend you make a habit of using the same uppercase and lowercase combination as I did in the list of properties below. This will make your regular expressions work with all Unicode regex engines.
In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties. \pLl is not the equivalent of \p{Ll}. It is the equivalent of \p{L}l which matches Al or àl or any Unicode letter followed by a literal l.
Perl, XRegExp, and the JGsoft engine also support the longhand \p{Letter}. You can find a complete list of all Unicode properties below. You may omit the underscores or use hyphens or spaces instead.
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
···
贪婪与非贪婪
.*:匹配任意个字符
.*?:非贪婪匹配任意个字符
Python里re默认是贪婪的,总是尝试匹配尽可能多的字符;非贪婪则相反,总是尝试匹配尽可能少的字符"*","?","+","{m,n}"后面加上?,使贪婪变成非贪婪
实例:
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result)
print(result.group())
print(result.groups())
print(result.span())
结果:
<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
Hello 1234567 World_This is a Regex Demo
('7',)
(0, 40)
只匹配到7
在贪婪匹配下,.*(点星)会匹配尽可能多的字符。正则表达式中.*后面是\d+,也就是至少一个数字,因此.*就尽可能匹配多的字符,此时匹配了123456
非贪婪匹配是.*?,多一个?
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+).*Demo$', content)
print(result.group())
print(result.groups())
print(result.span())
结果
Hello 1234567 World_This is a Regex Demo
('1234567',)
(0, 40)
此时匹配到1234567
故而可知,贪婪匹配尽可能匹配多的字符;非贪婪匹配尽可能匹配少的字符
因此匹配时,字符串中间尽量使用非贪婪匹配
常用语法
匹配URL:
[a-zA-z]*://[^\s]*
[a-zA-z]* 匹配任意个字母
[^\s]* 匹配非空白字符任意字符
匹配Email:
[\w!#$%&'*+/=?^_`{|}~-]+(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?
匹配形如1,232,123的字符串
(\d+\,?)+
本文深入解析正则表达式的应用技巧,涵盖re模块的基本函数、修饰符、对象及语法,通过实例展示如何进行字符串匹配、搜索、替换及分割,适合初学者至进阶者学习。
103

被折叠的 条评论
为什么被折叠?



