4-正则运用之爬取糗事百科

最新推荐文章于 2024-09-17 13:30:28 发布

原创最新推荐文章于 2024-09-17 13:30:28 发布 · 297 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Python3网络爬虫

Python3网络爬虫专栏收录该内容

7 篇文章

订阅专栏

本文介绍使用Python3和正则表达式爬取糗事百科网站的段子内容，通过实战巩固正则表达式的运用技巧。

【Python3网络爬虫】4-正则运用之爬取糗事百科

正则表达式实战巩固

import requests
from fake_useragent import UserAgent
import re

url = 'https://www.qiushibaike.com/text/page/{}/'
headers = {
    'User-Agent': UserAgent().chrome
}


def get_data(page):
    print("正在爬取第{}页".format(page))
    response = requests.get(url.format(page), headers=headers)
    info = response.text
    infos = re.findall(r'<div class="content">\s*<span>\s*(.+)\s*</span>', info)
    with open('duanzi.txt', 'a+', encoding='utf-8') as f:
        for info in infos:
            info = info.replace("\s", "")
            f.write(info + "\n\n")


for page in range(1, 14):
    get_data(page)