Python编程快速上手——模式匹配与正则表达式-优快云博客

本文链接：https://blog.youkuaiyun.com/ZBDX2113/article/details/124070440

基本操作：

1.使用正则表达式查找文本模式

正则表达式，简称为 regex，是文本模式的描述方法。

首先需要创建正则表达式对象

import re

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

Python所有正则表达式均在re模块中，complie将会返回一个Regex模式对象。

接下来需要匹配Regex对象

可以使用search（）方法来寻找字符串，寻找所有匹配对象，如果找到了该模式对象，就会返还一个Match对象，再使用group（）方法返回查找过程中实际匹配的文本

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
 mo = phoneNumRegex.search('My number is 415-555-4242.')
 print('Phone number found: ' + mo.group())
Phone number found: 415-555-4242

2.利用括号进行分组

在正则表达式中，在原始字符串里添加括号可以进行分组，第一对括号为第一组，第二对为第二组，以此类推，向之前使用过的group（）传入参数，即可得到文本的不同部分。

 phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
 mo = phoneNumRegex.search('My number is 415-555-4242.')
 mo.group(1)
 mo.group(2)
'555-4242'
 mo.group(0)
'415-555-4242'
mo.group()
'415-555-4242'

如果想一次性得到两组，就是用方法groups（）方法。

在使用括号分组时，如果在文中需要匹配一个括号，那就可以使用转义字符转义括号。

 phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
 mo = phoneNumRegex.search('My phone number is (415) 555-4242.')
 mo.group(1)
'(415)'
 mo.group(2)
'555-4242'

3.用管道匹配多个分组

字符”|“被称为管道，希望匹配许多表达式中的一个时可以使用它，若多者均在被查找的字符串中，优先返回第一次出现的文本。

>>> heroRegex = re.compile (r'Batman|Tina Fey')
>>> mo1 = heroRegex.search('Batman and Tina Fey.')
>>> mo1.group()
'Batman'
>>> mo2 = heroRegex.search('Tina Fey and Batman.')
>>> mo2.group()
'Tina Fey'

同时也可以使用管道来匹配多个模式中的一个

>>> batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
>>> mo = batRegex.search('Batmobile lost a wheel')
>>> mo.group()
'Batmobile'
>>> mo.group(1)
'mobile'

4.用问号实现可选匹配

换句话说，在该模式下，不论这段文本在不在，正则表达式都会认为匹配。

>>> batRegex = re.compile(r'Bat(wo)?man')
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'
>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'

5.用星号匹配零次或多次

即星号之前的分组可以出现零次也可以出现无数次。

>>> batRegex = re.compile(r'Bat(wo)*man')
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'
>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'
>>> mo3 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo3.group()
'Batwowowowoman'

6.用加号匹配一次或多次

这就要求加号之前的分组至少需要匹配一次，这不是可选的。

>>> batRegex = re.compile(r'Bat(wo)+man')
>>> mo1 = batRegex.search('The Adventures of Batwoman')
>>> mo1.group()
'Batwoman'
>>> mo2 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo2.group()
'Batwowowowoman'
>>> mo3 = batRegex.search('The Adventures of Batman')
>>> mo3 == None
True

7.用花括号匹配特定的次数

花括号内为循环的次数，它不仅可以使一个特定的数字，也可以是一个范围，第一个参数为最小是，第二个参数为最大值。

(Ha){3}
(Ha)(Ha)(Ha)


(Ha){3,5}
((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))


>>> haRegex = re.compile(r'(Ha){3}')
>>> mo1 = haRegex.search('HaHaHa')
>>> mo1.group()
'HaHaHa'
>>> mo2 = haRegex.search('Ha')
>>> mo2 == None
True

8.贪心和非贪心匹配

Python的正则表达式默认是贪心的，也就是说在有多种选择的情况下，它会尽可能的匹配最长的字符串，而花括号的非贪心会尽可能匹配较短的字符串，即在花括号结束后跟一个问号。

>>> greedyHaRegex = re.compile(r'(Ha){3,5}')
>>> mo1 = greedyHaRegex.search('HaHaHaHaHa')
>>> mo1.group()
'HaHaHaHaHa'
>>> nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
>>> mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
>>> mo2.group()
'HaHaHa'

9.findall（）方法

findall()方法将返回一组字符串，包含被查找字符串中的所有匹配。

如果调用在一个没有分组的则表达式上，例如\d\d\d-\d\d\d-\d\d\d\d，方法 findall()将返回一个匹配字符串的列表。

如果调用在一个有分组的正则表达式上，例如(\d\d\d)-(\d\d\d)-(\d\d\d\d)，方法 findall()将返回一个字符串的元组的列表（每个分组对应一个字符串）。

>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
['415-555-9999', '212-555-0000']


>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
[('415', '555', '1122'), ('212', '555', '0000')]

10.字符分类

缩写字符分类	表示
\d	表示从0到9的任意数字
\D	除0到9数字以外的任意字符
\w	任何字母，数字或下划线
\W	除字母，数字或下划线以外的其它字符
\s	空格，制表符，换行符
\S	除空格，制表符，换行符以外的字符

>>> xmasRegex = re.compile(r'\d+\s\w+')
>>> xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7
swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6
geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']

11.建立自己的字符分类

例如，字符分类[aeiouAEIOU]可以匹配所有的元音字符，无论大小写

>>> vowelRegex = re.compile(r'[aeiouAEIOU]')
>>> vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')
['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

也可以使用短横表示字母或是数字的范围，例如，字符分类[a-zA-Z0-9]将匹配所有小写字母、大写字母和数字。

通过在字符分类的左方括号后加上一个插入字符（^），就可以得到“非字符类”。

>>> consonantRegex = re.compile(r'[^aeiouAEIOU]')
>>> consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')
['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', '
', 'B', 'B', 'Y', ' ', 'F', 'D', '.']

12.插入字符与美元字符

可以在正则表达式的开始处使用插入符号（^ ），表明匹配必须发生在被查找文本开始处。类地以再正则表达式的末尾加上美元符号（ $ ），表示该字符串必须以这个正则表达式的模式结束。可以同时使用 ^ 和 $ ，表明整个字符串必须匹配该模式。

>>> beginsWithHello = re.compile(r'^Hello')
>>> beginsWithHello.search('Hello world!')
<_sre.SRE_Match object; span=(0, 5), match='Hello'>
>>> beginsWithHello.search('He said hello.') == None
True


>>> endsWithNumber = re.compile(r'\d$')
>>> endsWithNumber.search('Your number is 42')
<_sre.SRE_Match object; span=(16, 17), match='2'>
>>> endsWithNumber.search('Your number is forty two.') == None
True


>>> wholeStringIsNum = re.compile(r'^\d+$')
>>> wholeStringIsNum.search('1234567890')
<_sre.SRE_Match object; span=(0, 10), match='1234567890'>
>>> wholeStringIsNum.search('12345xyz67890') == None
True
>>> wholeStringIsNum.search('12 34567890') == None
True

13.通配字符

。称为通配字符，它可以匹配除换行之外的任何字符。

>>> atRegex = re.compile(r'.at')
>>> atRegex.findall('The cat in the hat sat on the flat mat.')
['cat', 'hat', 'sat', 'lat', 'mat']

。*字符可以匹配任意字符

>>> nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
>>> mo = nameRegex.search('First Name: Al Last Name: Sweigart')
>>> mo.group(1)
'Al'
>>> mo.group(2)
'Sweigart'

该字符匹配时为贪心模式，像之前的操作一样加上问号之后即可为非贪心模式。

>>> nongreedyRegex = re.compile(r'<.*?>')
>>> mo = nongreedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man>'
>>> greedyRegex = re.compile(r'<.*>')
>>> mo = greedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man> for dinner.>'

也可以使用句点匹配换行符，只需要加入参数re.DOTALL。

>>> noNewlineRegex = re.compile('.*')
>>> noNewlineRegex.search('Serve the public trust.\nProtect the innocent.
\nUphold the law.').group()
'Serve the public trust.'

>>> newlineRegex = re.compile('.*', re.DOTALL)
>>> newlineRegex.search('Serve the public trust.\nProtect the innocent.
    \nUphold the law.').group()
    'Serve the public trust.\nProtect the innocent.\nUphold the law.'

14.不区分大小写的匹配

要让正则表达式不区分大小写，可以向 re.compile() 传入 re.IGNORECASE 或 re.I ，作为第二个参数。

>>> robocop = re.compile(r'robocop', re.I)
>>> robocop.search('RoboCop is part man, part machine, all cop.').group()
'RoboCop'
>>> robocop.search('ROBOCOP protects the innocent.').group()
'ROBOCOP'
>>> robocop.search('Al, why does your programming book talk about robocop so much?').group()
'robocop'

15.使用sub（）方法代替字符串

该方法需要两个参数，第一个参数是一个字符串，用于取代发现的匹配。第二个参数是一个字符串，即正则表达式。

>>> namesRegex = re.compile(r'Agent \w+')
>>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
'CENSORED gave the secret documents to CENSORED.'

在 sub() 的第一个参数中，可以输入 \1 、 \2 、 \3…… 。表示“在替换中输入分组 1 、 2 、 3…… 的文本”，例如

>>> agentNamesRegex = re.compile(r'Agent (\w)\w*')
>>> agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent
Eve knew Agent Bob was a double agent.')
A**** told C**** that E**** knew B**** was a double agent.'

16.管理复杂的正则表达式

在遇到比较麻烦的正则表达式时，可以使用方法complie（），忽略正则表达式字符串中的空白符和注释，只需要传入变量re.VERBOSE作为第二个参数。

phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}
(\s*(ext|x|ext.)\s*\d{2,5})?)')


#可以改为
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?          # area code
(\s|-|\.)?         # separator
\d{3}       # first 3 digits
(\s|-|\.)        # separator
\d{4}         # last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})?         # extension
)''', re.VERBOSE)

17.同时使用re.IGNOREC ASE、re.DOTALL 和 re.VERBOSE

 someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)

例题1：

写一个函数，它使用正则表达式，确保传入的口令字符串是强口令。强口令的

定义是：长度不少于 8 个字符，同时包含大写和小写字符，至少有一位数字。你可

能需要用多个正则表达式来测试该字符串，以保证它的强度。

import re
def find(password):
    list=[r'[a-zA-Z0-9]{8,}',r'[a-z]+',r'[A-Z]+',r'[0-9]+']
    for i in list:
      m=re.compile(i).search(password)
      if m  == None:

          print('密码格式错误')
      else:
          print('密码格式正确')
    return password
print('please input  your password:')
password=input()
find(password)

例题2：

写一个函数，它接受一个字符串，做的事情和 strip() 字符串方法一样。如果只

传入了要去除的字符串，没有其他参数，那么就从该字符串首尾去除空白字符。否

则，函数第二个参数指定的字符将从该字符串中去除。

import re

def re_strip(str, delete_str):
    if delete_str == None:
        print(str)
    else:
        regex = re.compile(r'' + delete_str + '', re.I)#该处的re.I意为忽略大小写
    str = regex.sub("", str)
    print("删除后的字符串：" + str)

str1 = input("请输入字符串：\n")
str2 = input("需要删除的字符是：\n")
re_strip(str1, str2)