正则表达式,在过滤字符串和匹配检查字符串上具有简便性。再字符中处理,爬虫信息过滤中有很有用。
下面是一些匹配的对应
.正则表达式相关注解
re.compile(string[,flag])
#以下为匹配所用函数
re.match(pattern,
string[,
flags])
re.search(pattern,
string[,
flags])
re.split(pattern,
string[,
maxsplit])
re.findall(pattern,
string[,
flags])
re.finditer(pattern,
string[,
flags])
re.sub(pattern,
repl,
string[,
count])
re.subn(pattern,
repl,
string[,
count])
pattem=re.compile(String s)//将string生成一个patte对象
然后调用下面的方法
re.match(pattern,
string[,
flags]):顺序匹配,匹配pattern成功后,就不继续往下匹配了
import re
# 匹配如下内容:单词+空格+单词+任意字符
m = re.match(r'(\w+) (\w+)(?P.*)', 'hello world!')
print "m.string:", m.string
print "m.re:", m.re
print "m.pos:", m.pos
print "m.endpos:", m.endpos
print "m.lastindex:", m.lastindex
print "m.lastgroup:", m.lastgroup
print "m.group():", m.group()
print "m.group(1,2):", m.group(1, 2)
print "m.groups():", m.groups()
print "m.groupdict():", m.groupdict()
print "m.start(2):", m.start(2)
print "m.end(2):", m.end(2)
print "m.span(2):", m.span(2)
print r"m.expand(r'\g \g\g'):", m.expand(r'\2 \1\3')
### output ###
# m.string: hello world!
# m.re:
# m.pos: 0
# m.endpos: 12
# m.lastindex: 3
# m.lastgroup: sign
# m.group(1,2): ('hello', 'world')
# m.groups(): ('hello', 'world', '!')
# m.groupdict(): {'sign': '!'}
# m.start(2): 6
# m.end(2): 11
# m.span(2): (6, 11)
# m.expand(r'\2 \1\3'): world hello!
re.serch(pattern, string[, flags]):与re.match类似
只是match是从0位开始匹配, serch是全局匹配
re.split()
import re
pattern = re.compile(r'\d+')
print re.split(pattern,'one1two2three3four4')
### 输出 ###
# ['one', 'two', 'three', 'four', '']
re.findall(pattern,
string[,
flags])
:这个是用的很频繁的,在提取网页信息的时候会常用到。import re
pattern = re.compile(r'\d+')
print re.findall(pattern,'one1two2three3four4')
<div class="crayon-line crayon-striped-line" id="crayon-577edc1a30182732883125-6" style="border: none; margin: 0px; padding: 0px 5px; font-size: 13px; font-family: Monaco, MonacoRegular, 'Courier New', monospace; background-image: none; background-color: rgb(248, 248, 255); height: inherit; line-height: 15px; white-space: pre;"><span class="crayon-c" style="border: 0px; margin: 0px; padding: 0px; font-size: inherit !important; font-family: inherit; height: inherit; line-height: inherit !important; font-weight: inherit !important; color: rgb(153, 153, 153) !important; font-style: italic !important;">### 输出 ###</span></div><div class="crayon-line" id="crayon-577edc1a30182732883125-7" style="border: none; margin: 0px; padding: 0px 5px; font-size: 13px; font-family: Monaco, MonacoRegular, 'Courier New', monospace; background-image: none; background-color: rgb(248, 248, 255); height: inherit; line-height: 15px; white-space: pre;"><span class="crayon-c" style="border: 0px; margin: 0px; padding: 0px; font-size: inherit !important; font-family: inherit; height: inherit; line-height: inherit !important; font-weight: inherit !important; color: rgb(153, 153, 153) !important; font-style: italic !important;"># ['1', '2', '3', '4']</span></div>
最后写一个具体的在网页信息提取中的作用
爬取网页信息后