python爬虫里的正则匹配简单使用

最新推荐文章于 2024-10-15 10:30:00 发布

原创最新推荐文章于 2024-10-15 10:30:00 发布 · 620 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#正则表达式 #re #python

python爬虫专栏收录该内容

2 篇文章

订阅专栏

博客介绍了Python中正则表达式的应用。一是用正则表达式从包含三种形式的url列表中选取目标网站的url；二是展示了使用re.match函数进行字符串分析，如匹配无内容标题和数字开头的字符串，并返回布尔值。

一：正则选取有特定规律的网页

f=['http://zhongyaofangji.com/yaofang/acaitang.html', 'http://zhongyaofangji.com/yaofang/alajijiu.html', 
'http://zhongyaofangji.com/yaofang/book-bojifang.html',
'http://zhongyaofangji.com/#A']

#首先要分析链接，这个url列表中一共有三种形式的url，分别为目标网站（http://zhongyaofangji.com/yaofang/+items+.html）、干扰网站（http://zhongyaofangji.com/yaofang/+book-开头+.html和http://zhongyaofangji.com/#A）。我们可以用正则表达式将目标url选出来。

import re
f=['http://zhongyaofangji.com/yaofang/acaitang.html', 'http://zhongyaofangji.com/yaofang/alajijiu.html', 
'http://zhongyaofangji.com/yaofang/book-bojifang.html',
'http://zhongyaofangji.com/#A']
for key in f:
    if re.match(r"^http://zhongyaofangji.com/yaofang/[a-z]*.html$",key):
        print(key)


#http://zhongyaofangji.com/yaofang/acaitang.html
#http://zhongyaofangji.com/yaofang/alajijiu.html

二：正则表达式分析字符串

re.match(r' 正则表达式 '，内容) #返回布尔值

1. re.match(r'^【.*】$',content) #正则匹配无内容标题

2. re.match(r'^\d',content) #正则匹配数字开头字符串