正则表达式

最新推荐文章于 2023-12-03 19:38:16 发布

rongDang

最新推荐文章于 2023-12-03 19:38:16 发布

阅读量479

点赞数

分类专栏： python爬虫

python爬虫专栏收录该内容

9 篇文章

订阅专栏

主要的内容

正则表达式常用符号
re模块和方法
例1----爬取斗破苍穹全文小说
例2----爬取糗事百科段子信息

正则表达式中的常用符号

1，一般字符

2，预定义字符集

3，数量词

4，边界匹配

re模块中的search（）函数

re模块的search（）函数匹配并提取第一个符合规律的内容，返回一个正则表达式对象。基本语法为:

re.search(pattern, string, flags=0)

1,pattern为匹配的正则表达式。2，string为要匹配的字符串。3，flags为标志位，用于控制正则表达式的方法，比如是否区分大小写（re.I）,匹配多行等。

例：

import re
a = "123abc456"
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(0)   #123abc456,返回整体
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(1)   #123
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(2)   #abc
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(3)   #456

group（）函数用于把匹配的结果分组

如上例子，group（）和group（0）返回的结果一样，匹配正则表达式整体结果，group（1）返回第一个括号匹配的结果，以此类推

re模块的sub（）函数

re模块提供sub（）函数用于替换字符串中的匹配项，sub（）函数的基本语法：

re.sub(pattern, repl, string, count=0, flags=0)

1,pattern为匹配的正则表达式。2，repl为替换的字符串。3，string为模式匹配后替换的最大次数，默认0表示替换所以的匹配。4，flags是标志位，和search（）函数的标志位一样

findall（）函数

findall（）函数用于匹配所有符合规律的内容，返回的是一个列表结果，例如：

import re
s = "asdasd12asdvsd9vvd"
f = re.findall("\d+",s)
print f      #["12","9"]

re模块修饰符

re模块中包含一些可选标志修饰符来控制匹配的模式，如下表所示

综合示例（一），爬取斗破苍穹全文小说

网页地址：http://www.doupoxs.com/doupocangqiong/ 观察分析了下前几章节的地址，如下所示，就前两章地址没有规律后面章节的地址就只是最后的数字变换

第一章：http://www.doupoxs.com/doupocangqiong/2.html

第二章：http://www.doupoxs.com/doupocangqiong/5.html

第三章：http://www.doupoxs.com/doupocangqiong/6.html

第四章：http://www.doupoxs.com/doupocangqiong/7.html

具体的代码实现如下：

# -*- encoding:utf8 -*-
import re
import requests
import time
# 爬取斗破苍穹全文小说，保存到txt文档中

# 请求头
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
# 新建txt文档，a+表示追加的方式
f = open(r"E:\Python_Data\python_reptile\xiaoshuo.txt","a+")

# 定义获取信息的函数
def get_info(url):
    res = requests.get(url,headers=headers)
    # 判断请求码是否为200
    if res.status_code == 200:
        # re.S多行匹配，包含换行在内的所有字符,这里获取的是一章的所有段落,返回的是一个列表
        contents = re.findall("<p>(.*?)</p>",res.content,re.S)
        for content in contents:
            f.write(content+"\n")
    else:
        # 若请求码不为200，则表示没有获取到数据，不做处理
        pass

# 程序的入口
if __name__ == "__main__":
    # 创建url
    urls = ['http://www.doupoxs.com/doupocangqiong/{}.html'.format(str(i)) for i in range(2,1665)]
    for url in urls:
        get_info(url)
        # 休眠1秒，这里无所谓，小网址没有防爬
        time.sleep(1)
    # 关闭txt文件
    f.close()
    pass

爬取的结果如下：

综合示例（2）爬取糗事百科的文字内容信息

文字内容共有13页，观察网址，只有最后的数字变化，由此可以规律的构建要爬取的网址，具体实现代码如下

# -*- encoding:utf8 -*-
# 爬取糗事百科的文字信息,id，评论数
import requests,re,itertools
# 请求头
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}

# 获取用户的性别,根据类的名字来判断
def judment_sex(class_name):
    if class_name == "womenIcon":
        return "女"
    else:
        return "男"
    pass

# 获取详细信息
def get_info(url):
    print "asdasds"
    res = requests.get(url,headers=headers)
    # 获取id,re.S表示匹配任意字符
    ids = re.findall("<h2>(.*?)</h2>",res.text,re.S)
    # 获取等级        这样写会获取不到内容(外双内单的话)"<div class='articleGender \D+Icon'>(.*?)</div>"
    levels = re.findall('<div class="articleGender \D+Icon">(.*?)</div>',res.text,re.S)
    # 获取性别
    sexs = re.findall('<div class="articleGender (.*?)">',res.text,re.S)
    # 获取内容
    contents = re.findall('<div class="content">.*?<span>(.*?)</span>',res.text,re.S)
    # 获取点赞数,
    laughs = re.findall('<span class="stats-vote"><i class="number">(\d+)</i>',res.text,re.S)
    for id,level,sex,content,laugh in itertools.izip(ids,levels,sexs,contents,laughs):
        print "id:",id.strip()
        print "等级:",level
        print "性别:",sex
        print "内容:",content.strip()
        print "点赞数",laugh,"\n"
    pass
# 程序的入口
if __name__=="__main__":
    # 构建url链接
    urls=["http://www.qiushibaike.com/text/page/{}/".format(i) for i in range(2,13)]
    for url in urls:
        get_info(url)
    pass

实现的结果如下，列出一部分