python3 正则模块re笔记

最新推荐文章于 2024-09-03 22:33:37 发布

原创最新推荐文章于 2024-09-03 22:33:37 发布 · 627 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #re #正则表达式

Python 专栏收录该内容

13 篇文章

订阅专栏

本文深入解析正则表达式的应用技巧，涵盖re模块的基本函数、修饰符、对象及语法，通过实例展示如何进行字符串匹配、搜索、替换及分割，适合初学者至进阶者学习。

文章目录

正则表达式re

正则表达式re

基本函数

match

re.match尝试从字符串的起始位置匹配一个模式pattern，若不是起始位置匹配成功返回None(注意起始位置即开头如abc开头必须匹配上abc否则返回None)
一般用于某个字符串是否符合某个正则表达式的规则

re.match(pattern, string, flags=0)
    pattern: 匹配的正则表达式
    string: 要匹配的字符串
    flags: 标志位，用于控制匹配方式，如是否对大小写不敏感re.l;是否多行匹配re.M 即修饰符(影响^和$)
匹配成功返回一个匹配的对象

匹配对象SRE_Match 输出匹配内容 group方法

对匹配对象调用group方法获取指定匹配
group(num=0)：匹配的整个表达式的字符串，group() 可以一次输入多个组号，在这种情况下它将返回一个包含那些组所对应值的元组。
    (?P<name>)搭配group('name') 自定义组名
groups()：返回一个包含所有小组字符串的元组，从 1 到 所含的小组号
一般和()搭配
如 www.baidu.com
match('(w)\w{2}(\s)')

匹配对象span方法输出匹配范围

返回匹配位置的元组(0,3) 从索引0到索引3

示例：

import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^Hello\s(\d+)\sWorld', content)
print(result)
print(result.group())
print(result.group(1)) # 输出第一个被()包围的匹配结果
print(result.span())
返回
<_sre.SRE_Match object; span=(0, 19), match='Hello 1234567 World'>
Hello 1234567 World
1234567  # 此处是()匹配项
(0, 19)  # Hello 1234567 World 位置索引0到索引19

search

re.search扫描整个字符串并返回第一个成功的匹配
函数语法

re.search(pattern, string, flags=0)
    pattern: 匹配的正则表达式
    string: 要匹配的字符串
    flags: 标志位同match

匹配成功re.search方法返回一个匹配的对象，否则返回None
可以使用group(num)或groups()匹配对象函数来获取匹配表达式

group(num = 0) ： 匹配的整个表达式的字符串，group()可以一次输入多个组号，在这种情况下它将返回一个包含那些组所对应值的元组
groups() : 返回一个包含所有小组字符串的元组，从1到所含的小组号，如('123456', 'World_this')

实例：

import re
content = 'first line' \
          'Hello 123456 World_this is a Regex Demo'
result = re.search('Hello\s(\d+)\s(\w+)', content)
print(result)
print(result.group())
print(result.group(1))
print(result.groups())
打印结果：
<_sre.SRE_Match object; span=(10, 33), match='Hello 123456 World_this'>
Hello 123456 World_this
123456
('123456', 'World_this')

实例2：

import re
line = 'Cats are smarter than dogs'
searchObj = re.search(r'(.*) are (.*?) .*', line, re.M|re.I) # 修饰符 多行匹配；对大小写不敏感
if searchObj:
    print(searchObj.groups())
    print(searchObj.group())
else:
    print('Nothing found!')
打印
('Cats', 'smarter')
Cats are smarter than dogs

若是.*?换成.*  则会贪婪匹配，匹配smarter than

re.search与re.match的区别:
re.match只匹配字符串的开始，若字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配

检索和替换sub

sub用于替换字符串中匹配项;常用于在正则匹配前处理数据
语法：

re.sub(pattern, repl, string, count=0, flags=0)
    pattern: 正则中的模式字符串
    repl: 替换的字符串，也可为一个函数
    string: 要被查找替换的原始字符串
    count: 模式匹配后替换的最大次数，默认0表示替换所有的匹配
    flags: 编译时用的匹配模式，数字形式
    前三个是必选参数，后两者是可选参数

实例：

import re

phone = '2004-959-559 # 电话号码'
# 删除注释
num = re.sub(r'#.*$', '', phone)
print(num)
# 移除非数字的内容
num = re.sub(r'\D', '', phone)
print(num)
打印
2004-959-559 
2004959559

repl参数是一个函数时：

import re

# 将匹配的数字乘以2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)

s = '1A2B3C32G12F'
print(re.sub('(?P<value>\d+)', double, s))

compile

compile函数用于编译正则表达式，生成一个正则表达式Pattern对象，供match()和search()两个函数使用
语法格式：

re.compile(pattern[, flags])
    pattern: 一个字符串形式的正则表达式
    flags: 可选，匹配模式
        re.I 忽略大小写
        re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
        re.M 多行模式
        re.S 即为' . '并且包括换行符在内的任意字符（' . '不包括换行符）
        re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
        re.X 为了增加可读性，忽略空格和' # '后面的注释

实例：

import re

# 用于匹配至少一个数字
pattern = re.compile(r'\d+')
m = pattern.match('1abc12d3ef45')

pattern对象：

match(string, pos, endpos):
    匹配str的起始位置，或匹配start索引到end索引位置
    match：
        group([group1, …]) 方法用于获得一个或多个分组匹配的字符串，当要获得整个匹配的子串时，可直接使用 group() 或 group(0)；
        start([group]) 方法用于获取分组匹配的子串在整个字符串中的起始位置（子串第一个字符的索引），参数默认值为 0；
        end([group]) 方法用于获取分组匹配的子串在整个字符串中的结束位置（子串最后一个字符的索引+1），参数默认值为 0；
        span([group]) 方法返回 (start(group), end(group))。
search(string, pos, endpos)

sub(string, pos, endpos)

findall(string, pos, endpos)

finditer(string, pos, endpos)

findall

在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，若没有找到匹配的，则返回空列表
注意：match与search是匹配一次findall则匹配所有
格式：

re.findall(string[, pos[, endpos]])
    string: 待匹配的字符串
    pos: 可选参数，指定字符串的起始位置，默认0
    endpos: 可选参数，指定字符串的结束为止，默认为字符串的长度

实例：

pattern = re.compile(r'\d+')
result1 = pattern.findall('abc 123 bcd 456')
result2 = pattern.findall('a123bb456c5c789', 0, 10)
print(result1)
print(result2)
结果：
['123', '456']
['123', '456']

pattern:

findall(string, pos, endpos)：不包括endpos位置包括pos起始位置

finditer

和findall类似，在字符串中找到正则表达式所匹配的所有子串，并把它们作为一个迭代器返回，即可以遍历
语法：

re.finditer(pattern, string, flags=0)
    pattern: 匹配的正则表达式
    string: 要匹配的字符串
    flags: 标志位

实例：

iter = re.finditer(r'\d+', '12a23b45c')
for num in iter:
    print(num.group())
结果：返回的是group的迭代器
12
23
45

split

split方法按照能够匹配的子串将字符串分割后返回列表

语法：

re.split(pattern, string[, maxsplit=0, flags=0])
    pattern: 匹配的正则表达式
    string: 要匹配的字符串
    maxsplit: 分隔次数，maxsplit=1 分隔一次，默认为0，不限制次数
    flags: 标志位

实例：

text = 'abc,abc, abc..'
arr = re.split(r'\W+', text)
print(arr)
结果：根据匹配规则\W+即非字母数字下划线分割字符串 ，对于一个找不到匹配的字符串而言，split 不会对其作出分割
['abc', 'abc', 'abc', '']

修饰符

re.I 忽略大小写
re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
re.M 多行模式,影响^和$
re.S 即为'.'并且包括换行符在内的任意字符（'.'不包括换行符），即使.匹配包括换行在内的所有字符
re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
re.X 为了增加可读性，忽略空格和' # '后面的注释

网页匹配常用re.I和re.S

正则对象

RegexObject

re.compile() 返回 RegexObject 对象。

MatchObject

group() 返回被 RE 匹配的字符串。

start() 返回匹配开始的位置
end() 返回匹配结束的位置
span() 返回一个元组包含匹配 (开始,结束) 的位置

正则表达式语法

由于正则表达式通常都包含反斜杠，所以你最好使用原始字符串来表示它们。模式元素(如 r’\t’，等价于 \t )匹配相应的特殊字符

常用匹配规则

非打印字符

模式	说明
\f	匹配一个换页符。等价于 \x0c 和 \cL。
\n	匹配一个换行符。等价于 \x0a 和 \cJ。
\r	匹配一个回车符。等价于 \x0d 和 \cM。
\s	匹配任何空白字符，包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]
\S	匹配任何非空白字符。等价于 [^ \f\n\r\t\v] or [^s]
\t	匹配一个制表符。等价于 \x09 和 \cI
\v	匹配一个垂直制表符。等价于 \x0b 和 \cK

特殊字符

模式	说明
$	匹配输入字符串的结尾位置
()	标记一个子表达式的开始和结束位置,即匹配括号内的表达式，也表示一个组
*	匹配前面的子表达式零次或多次
+	匹配前面的子表达式一次或多次
.	匹配除换行符 \n 之外的任何单字符
[	标记一个中括号表达式的开始
?	匹配前面的子表达式零次或一次，或指明一个非贪婪限定符
\	将下一个字符标记为或特殊字符、或原义字符、或向后引用、或八进制转义符
^	匹配输入字符串的开始位置，除非在方括号表达式中使用，此时它表示不接受该字符集合
{	标记限定符表达式的开始
\|	指明两项之间的一个选择或的关系 a

限定符
特殊字符

模式	说明
*	任意次
+	至少1次
？	0或1次
{n}	精确匹配n个前面的表达式，即匹配n次表达式 o{2} 匹配food 匹配2次o
{n,}	至少匹配n次
{n,m}	匹配n到m次前面正则表达式定义的片段，贪婪方式
[…]	用来表示一组字符，单独列出如[ab] 匹配a,b而非ab
[^…]	不在[]中的字符，如[^ab] 匹配除ab外任意一个字符

元字符

模式	说明
\w	匹配字母，数字，下划线
\W	匹配不是字母，数字，下划线的字符
\d	匹配任意数字等价[0-9]
\D	匹配任意非数字的字符
\A	匹配字符串开头
\Z	匹配字符串结尾，若存在换行，只匹配换行前的结束字符串
\z	同上，但同时会匹配换行符
\G	匹配最后完成的位置

unicode种类匹配

不可见的控制符和未使用的code码

Abb：C Long form：Other
\p{C} or \p{Other}: invisible control characters and unused code points
https://www.regular-expressions.info/unicode.html#prop
http://www.unicode.org/reports/tr18/

In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain category. You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.

Again, “character” really means “Unicode code point”. \p{L} matches a single code point in the category “letter”. If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category “letter”, while U+0300 is in the category “mark”.

You should now understand why \P{M}\p{M}*+ is the equivalent of \X. \P{M} matches a code point that is not a combining mark, while \p{M}*+ matches zero or more code points that are combining marks. To match a letter including any diacritics, use \p{L}\p{M}*+. This last regex will always match à, regardless of how it is encoded. The possessive quantifier makes sure that backtracking doesn’t cause \P{M}\p{M}*+ to match a non-mark without the combining marks that follow it, which \X would never do.

PCRE, PHP, and .NET are case sensitive when it checks the part between curly braces of a \p token. \p{Zs} will match any kind of space character, while \p{zs} will throw an error. All other regex engines described in this tutorial will match the space in both cases, ignoring the case of the category between the curly braces. Still, I recommend you make a habit of using the same uppercase and lowercase combination as I did in the list of properties below. This will make your regular expressions work with all Unicode regex engines.

In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties. \pLl is not the equivalent of \p{Ll}. It is the equivalent of \p{L}l which matches Al or àl or any Unicode letter followed by a literal l.

Perl, XRegExp, and the JGsoft engine also support the longhand \p{Letter}. You can find a complete list of all Unicode properties below. You may omit the underscores or use hyphens or spaces instead.

\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

···

贪婪与非贪婪

.*:匹配任意个字符
.*?:非贪婪匹配任意个字符

    Python里re默认是贪婪的，总是尝试匹配尽可能多的字符；非贪婪则相反，总是尝试匹配尽可能少的字符"*","?","+","{m,n}"后面加上？，使贪婪变成非贪婪

实例：

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result)
print(result.group())
print(result.groups())
print(result.span())
结果：
<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
Hello 1234567 World_This is a Regex Demo
('7',)
(0, 40)
只匹配到7

在贪婪匹配下，.*(点星)会匹配尽可能多的字符。正则表达式中.*后面是\d+,也就是至少一个数字，因此.*就尽可能匹配多的字符，此时匹配了123456

非贪婪匹配是.*?,多一个？

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+).*Demo$', content)
print(result.group())
print(result.groups())
print(result.span())
结果
Hello 1234567 World_This is a Regex Demo
('1234567',)
(0, 40)
此时匹配到1234567

故而可知，贪婪匹配尽可能匹配多的字符；非贪婪匹配尽可能匹配少的字符
因此匹配时，字符串中间尽量使用非贪婪匹配

常用语法

匹配URL:

[a-zA-z]*://[^\s]*
[a-zA-z]* 匹配任意个字母   
[^\s]*  匹配非空白字符任意字符

匹配Email：

[\w!#$%&'*+/=?^_`{|}~-]+(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?

匹配形如1,232,123的字符串

(\d+\,?)+