正则表达式

最新推荐文章于 2024-05-20 10:35:51 发布

原创最新推荐文章于 2024-05-20 10:35:51 发布 · 209 阅读

0 ·

CC 4.0 BY-SA版权

python 专栏收录该内容

2 篇文章

订阅专栏

正则表达式，在过滤字符串和匹配检查字符串上具有简便性。再字符中处理，爬虫信息过滤中有很有用。

下面是一些匹配的对应

.正则表达式相关注解

re.compile(string[,flag])  

#以下为匹配所用函数

re.match(pattern,
string[,
flags])

re.search(pattern,
string[,
flags])

re.split(pattern,
string[,
maxsplit])

re.findall(pattern,
string[,
flags])

re.finditer(pattern,
string[,
flags])

re.sub(pattern,
repl,
string[,
count])

re.subn(pattern,
repl,
string[,
count])

pattem=re.compile(String s)//将string生成一个patte对象

然后调用下面的方法

re.match(pattern, string[, flags])：顺序匹配，匹配pattern成功后，就不继续往下匹配了

import re
# 匹配如下内容：单词+空格+单词+任意字符
m = re.match(r'(\w+) (\w+)(?P.*)', 'hello world!')
 
print "m.string:", m.string
print "m.re:", m.re
print "m.pos:", m.pos
print "m.endpos:", m.endpos
print "m.lastindex:", m.lastindex
print "m.lastgroup:", m.lastgroup
print "m.group():", m.group()
print "m.group(1,2):", m.group(1, 2)
print "m.groups():", m.groups()
print "m.groupdict():", m.groupdict()
print "m.start(2):", m.start(2)
print "m.end(2):", m.end(2)
print "m.span(2):", m.span(2)
print r"m.expand(r'\g \g\g'):", m.expand(r'\2 \1\3')
 
### output ###
# m.string: hello world!
# m.re: 
# m.pos: 0
# m.endpos: 12
# m.lastindex: 3
# m.lastgroup: sign
# m.group(1,2): ('hello', 'world')
# m.groups(): ('hello', 'world', '!')
# m.groupdict(): {'sign': '!'}
# m.start(2): 6
# m.end(2): 11
# m.span(2): (6, 11)
# m.expand(r'\2 \1\3'): world hello!

re.serch(pattern, string[, flags]):与re.match类似

只是match是从0位开始匹配， serch是全局匹配

re.split（）

import re

pattern = re.compile(r'\d+')
print re.split(pattern,'one1two2three3four4')

### 输出 ###
# ['one', 'two', 'three', 'four', '']

re.findall(pattern, string[, flags]) ：这个是用的很频繁的，在提取网页信息的时候会常用到。

import re
 
pattern = re.compile(r'\d+')
print re.findall(pattern,'one1two2three3four4')

<div class="crayon-line crayon-striped-line" id="crayon-577edc1a30182732883125-6" style="border: none; margin: 0px; padding: 0px 5px; font-size: 13px; font-family: Monaco, MonacoRegular, 'Courier New', monospace; background-image: none; background-color: rgb(248, 248, 255); height: inherit; line-height: 15px; white-space: pre;"><span class="crayon-c" style="border: 0px; margin: 0px; padding: 0px; font-size: inherit !important; font-family: inherit; height: inherit; line-height: inherit !important; font-weight: inherit !important; color: rgb(153, 153, 153) !important; font-style: italic !important;">### 输出 ###</span></div><div class="crayon-line" id="crayon-577edc1a30182732883125-7" style="border: none; margin: 0px; padding: 0px 5px; font-size: 13px; font-family: Monaco, MonacoRegular, 'Courier New', monospace; background-image: none; background-color: rgb(248, 248, 255); height: inherit; line-height: 15px; white-space: pre;"><span class="crayon-c" style="border: 0px; margin: 0px; padding: 0px; font-size: inherit !important; font-family: inherit; height: inherit; line-height: inherit !important; font-weight: inherit !important; color: rgb(153, 153, 153) !important; font-style: italic !important;"># ['1', '2', '3', '4']</span></div>

最后写一个具体的在网页信息提取中的作用

爬取网页信息后