Python宝典第20章：功能强大的正则表达式

本文深入探讨了正则表达式在Python中的使用方法，包括基本概念、元字符、常用函数如match、search、findall、sub、split，以及如何使用原始字符串减少转义字符的数量。此外，文章还介绍了如何利用正则表达式的分组功能，并展示了正则表达式在文件处理中的应用。

正则表达式是用某种模式去匹配一类具有共同特征的字符串。

正则表达式主要用于处理文本，尤其对于复杂的查找和替换。

Python中主要使用re模块进行正则表达式的操作，re模块提供了Perl风格的正则表达式。

元字符是正则表达式中具有特定含义的字符。正则表达式中，可以在字符串中使用元字符来匹配字符串的各种可能情况。

.：匹配除换行符以外任意字符
*：匹配位于*之前的0个或多个字符
+：匹配位于+之前的一个或多个字符
|：匹配位于|之前或者之后的字符
^：匹配行首
$：匹配行尾
?：匹配位于?之前的0个或一个字符
\：表示位于\之后的为转义字符
[]：匹配位于[]之内的任意一个字符
()：将位于()内的内容当做一个整体
{}：按{}中的次数进行匹配

re模块提供的函数：

match, search, findall进行搜索：

<span style="font-size:14px;">>>> import re
>>> s='Life can be good'
>>> print(re.match('can', s))
None
>>> print(re.search('can',s))
<_sre.SRE_Match object; span=(5, 8), match='can'>
>>> print(re.match('l.*',s))
None
>>> print(re.match('l.*',s,re.I))
<_sre.SRE_Match object; span=(0, 16), match='Life can be good'>
>>> re.findall('[a-z]{3}',s)
['ife', 'can', 'goo']
>>> re.findall('[a-z]{1,3}',s)
['ife', 'can', 'be', 'goo', 'd']
>>> re.findall('[a-zA-Z]{3}',s)
['Lif', 'can', 'goo']
>>> </span>

sub进行替换：

<span style="font-size:14px;">>>> import re
>>> s='Life can be bad'
>>> re.sub('bad', 'good', s)
'Life can be good'
>>> re.sub('bad|be', 'good', s)
'Life can good good'
>>> re.sub('bad|be', 'good', s, 1)
'Life can good bad'
>>> re.subn('bad|be', 'good', s, 1)
('Life can good bad', 1)
>>> r=re.subn('bad|be', 'good',s)
>>> print(r[0])
Life can good good
>>> print(r[1])
2</span>

split进行字符串分割：

<span style="font-size:14px;">>>> import re
>>> s='Life can be bad'
>>> re.split(' ',s)
['Life', 'can', 'be', 'bad']
>>> r=re.split(' ', s,1)
>>> for i in r:
	print(i)	
Life
can be bad
>>> re.split('b',s)
['Life can ', 'e ', 'ad']</span>

"\"开头的元字符：

\b：匹配单词头或单词尾
\B
\d：匹配数字
\D
\s：匹配空白字符
\S
\w：匹配任何字母数字和下划线
\W

>>> import re
>>> s='python can run on Windows'
>>> re.findall('\\bo.+?\\b',s)
['on']
>>> re.findall('\\bo.+\\b',s)
['on Windows']
>>> re.findall('\\Bo.+?',s)
['on', 'ow']
>>> re.findall('\so.+?',s)
[' on']
>>> re.findall('\\b\w.+?\\b',s)
['python', 'can', 'run', 'on', 'Windows']
>>> re.findall('\d\.\d', 'python 2.5')
['2.5']
>>> re.findall('\D+', 'python 2.5')
['python ', '.']
>>> re.findall('\D*', 'python 2.5')
['python ', '', '.', '', '']
>>> re.findall('\D?', 'python 2.5')
['p', 'y', 't', 'h', 'o', 'n', ' ', '', '.', '', '']
>>> re.split('\s',s)
['python', 'can', 'run', 'on', 'Windows']
>>> re.split('\s',s,1)
['python', 'can run on Windows']
>>> re.findall('\d\w+?','abc3de')
['3d']

compile函数将正则表达式编译成一个RegexObject对象实例。然后通过对象实例对字符串进行操作。

在正则表达式中使用原始字符串可以有效减少“\”的数目。

>>> import re
>>> r=re.compile('go*d')
>>> r.match('Life can be good')
>>> r.match('Life can be good',12)
<_sre.SRE_Match object; span=(12, 16), match='good'>
>>> r.search('Life can be good')
<_sre.SRE_Match object; span=(12, 16), match='good'>
>>> r=re.compile('b.\sg')
>>> r.search('Life can be good')
<_sre.SRE_Match object; span=(9, 13), match='be g'>
>>> r=re.compile('\w.\sg')
>>> r.search('Life can be good')
<_sre.SRE_Match object; span=(9, 13), match='be g'>
>>> r=re.compile('\\b\w..?\s')
>>> r.findall('Life can be good')
['can ', 'be ']
>>> r=re.compile('\\b\w..\s')
>>> r.findall('Life can be good')
['can ']

>>> import re
>>> s='''Life can be good;
... Life can be bad;
... Life is mostly cheerful;
... But sometimes sad.'''
>>> r=re.compile('b\w*',re.I)
>>> new=r.sub('*',s)
>>> print(new)
Life can * good;
Life can * *;
Life is mostly cheerful;
* sometimes sad.
>>> new=r.sub('*',s,2)
>>> print(new)
Life can * good;
Life can * bad;
Life is mostly cheerful;
But sometimes sad.
>>> r=re.compile('b\w*')
>>> new=r.subn('*',s)
>>> print(new[0])
Life can * good;
Life can * *;
Life is mostly cheerful;
But sometimes sad.
>>> print(new[1])
3
>>> new=r.subn('*',s,1)
>>> print(new[0])
Life can * good;
Life can be bad;
Life is mostly cheerful;
But sometimes sad.
>>> print(new[1])
1

>>> import re
>>> s='''Life can be good;
... Life can be bad;
... Life is mostly cheerful;
... But sometimes sad.'''
>>> r=re.compile('\s')
>>> news=r.split(s)
>>> print(news)
['Life', 'can', 'be', 'good;', 'Life', 'can', 'be', 'bad;', 'Life', 'is', 'mostly', 'cheerful;', 'But', 'sometimes', 'sad.']
>>> news=r.split(s,4)
>>> print(news)
['Life', 'can', 'be', 'good;', 'Life can be bad;\nLife is mostly cheerful;\nBut sometimes sad.']
>>> r=re.compile('b\w*',re.I)
>>> news=r.split(s)
>>> print(news)
['Life can ', ' good;\nLife can ', ' ', ';\nLife is mostly cheerful;\n', ' sometimes sad.']
>>> news=r.split(s,1)
>>> print(news)
['Life can ', ' good;\nLife can be bad;\nLife is mostly cheerful;\nBut sometimes sad.']
>>> r=re.compile('\w*e',re.I)
>>> news=r.split(s)
>>> print(news)
['', ' can ', ' good;\n', ' can ', ' bad;\n', ' is mostly ', 'rful;\nBut ', 's sad.']

正则表达式的分组：

在正则表达式中使用组，可以讲正则表达式分解成几个部分，在完成匹配和搜索后，可以使用分组编号访问不同部分的匹配内容。

(?P<组名>)

>>> import re
>>> s='Phone No. 010-87654321'
>>> r=re.compile(r'(\d+)-(\d+)')
>>> m=r.search(s)
>>> m
<_sre.SRE_Match object; span=(10, 22), match='010-87654321'>
>>> m.group(0)
'010-87654321'
>>> m.group(1)
'010'
>>> m.group(2)
'87654321'
>>> m.groups()
('010', '87654321')

分组有许多扩展语法：

(?iLmsux)：设置匹配标志
(?:...)：匹配但不捕获该匹配的子表达式。
(?P=anme)：表示在此之前名为name的组。
(?#...)：注释
(?=...)：用于正则表达式之后，如果“=”后内容在字符串中出现则匹配，但不返回“=”之后内容。
(?!...)：用于正则表达式之后，如果“！”后内容在字符串中出现则匹配，但不返回“！”之后内容。
(?<=...)：用于正则表达式之前，与(?=...)相同
(?<!...)：用于正则表达式之前，与(?!...)相同

>>> import re
>>> s='''Life can be good;
... Life can be bad;
... Life is mostly cheerful;
... But sometimes sad.'''
>>> r=re.compile(r'be(?=\sgood)')
>>> m=r.search(s)
>>> m
<_sre.SRE_Match object; span=(9, 11), match='be'>
>>> m.span()
(9, 11)
>>> r.findall(s)
['be']
>>> r=re.compile('be')
>>> r.findall(s)
['be', 'be']
>>> r=re.compile(r'be(?!\sgood)')
>>> m=r.search(s)
>>> m
<_sre.SRE_Match object; span=(27, 29), match='be'>
>>> m.span()
(27, 29)
>>> r=re.compile(r'(?:can\s)be(\sgood)')
>>> m=r.search(s)
>>> m
<_sre.SRE_Match object; span=(5, 16), match='can be good'>
>>> m.groups()
(' good',)
>>> m.group(1)
' good'
>r.findall(s)
>>> r.findall(s)
[' good']
>>> r=re.compile(r'(?P<first>\w)(?P=first)')
>>> r.findall(s)
['o', 'e']
>>> r=re.compile(r'(?<=can\s)b\w*\b')
>>> r.findall(s)
['be', 'be']
>>> r=re.compile(r'(?<!can\s)b\w*\b')
>>> r.findall(s)
['bad']
>>> r=re.compile(r'(?<!can\s)(?i)b\w*\b')
>>> r.findall(s)
['bad', 'But']

strat(), end(), span()返回匹配子字符串的索引。

使用正则表达式处理文件：

# -*- coding:utf-8 -*-
# file: GetFunction.py
#

import re
import sys

def DealWithFunc(s):
    r=re.compile(r'(?<=def\s)\w+\(.*?\)(?=:)',re.X|re.U)
    return r.findall(s)

def DealWithVar(s):
    vars=[]
    r=re.compile(r'\b\w+(?=\s?=)',re.X|re.U)
    vars.extend(r.findall(s))
    r=re.compile(r'(?<=for\s)\w+\s(?=in)',re.X|re.U)
    vars.extend(r.findall(s))
    return vars

if len(sys.argv)==1:
    #sour=input('请输入要处理的文件路径')
    sour=('pipei.py')
else:
    sour=sys.argv[1]
file = open(sour, encoding='utf-8')#file无法被检索到，不知道为啥。。。
s=file.readlines()
file.close()
print('*****************************************')
print(sour, '中的函数有：')
print('*****************************************')
i=0
for line in s:
    i=i+1
    function=DealWithFunc(line)
    if len(function)==1:
        print('Line: ',i,'\t',function[0])
print('*****************************************')
print(sour, '中的变量有：')
print('*****************************************')
i=0
for line in s:
    i=i+1
    var=DealWithVar(line)
    if len(var)==1:
        print('Line: ',i,'\t',var[0])