Python爬虫实战宝典：六大经典案例深度剖析与代码实战

本文链接：https://blog.youkuaiyun.com/eclipsercp/article/details/140405166

Python爬虫实战宝典：六大经典案例深度剖析与代码实战

引言

Python作为一门强大的编程语言，在网络爬虫和数据处理领域展现出了极高的灵活性和效率。本文将详细介绍六个常见的Python爬虫案例，从数据抓取到处理，再到可视化展示，全方位覆盖爬虫技术的各个方面。通过阅读本文，读者将能够了解并掌握使用Python进行网络爬虫的基本方法和技巧。

案例一：爬取豆瓣电影排行榜Top250并存储到Excel文件

目标：抓取豆瓣电影Top250的电影信息，并保存到Excel文件中。

步骤：

使用urllib发送HTTP请求获取HTML页面。
使用BeautifulSoup解析HTML页面，提取电影信息。
使用正则表达式匹配需要的信息（如电影链接、图片链接、片名、评分等）。
使用xlwt库将提取的信息保存到Excel文件中。

核心代码：

from bs4 import BeautifulSoup
import re
import urllib.request, urllib.error
import xlwt

def get_data(baseurl):
    datalist = []
    # 省略循环爬取多页的代码，仅展示单页解析逻辑
    html = get_url(baseurl)  # 假设get_url函数已定义，用于获取HTML内容
    soup = BeautifulSoup(html, "html.parser")
    for item in soup.find_all("div", class_='item'):
        # 省略详细解析和正则表达式匹配的代码
        # ...
        datalist.append(data)  # 假设data为包含一部电影所有信息的列表
    return datalist

def save_data(datalist, savepath):
    workbook = xlwt.Workbook(encoding="utf-8", style_compression=0)
    worksheet = workbook.add_sheet("豆瓣电影top250", cell_overwrite_ok=True)
    # 省略写入Excel的代码
    # ...
    workbook.save(savepath)

# 主函数（省略了完整的循环和错误处理代码）
if __name__ == '__main__':
    baseurl = "https://movie.douban.com/top250?start=0"  # 示例URL
    datalist = get_data(baseurl)
    savepath = "./豆瓣电影top250.xls"
    save_data(datalist, savepath)
    print("爬取成功！！！")

案例二：爬取百度热搜排行榜Top50并进行可视化展示

目标：抓取百度热搜Top50的数据，保存到Excel文件，并使用matplotlib进行可视化展示。

步骤：

使用requests发送HTTP请求获取HTML页面。
使用BeautifulSoup解析HTML页面，提取热搜词。
使用openpyxl将热搜词保存到Excel文件。
使用matplotlib绘制热搜词的条形图。

核心代码：

import requests
from bs4 import BeautifulSoup
import openpyxl
import matplotlib.pyplot as plt

def get_hot_searches():
    url = 'https://top.baidu.com/board?tab=realtime'
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html, 'html.parser')
    hot_searches = [item.text for item in soup.find_all('div', {'class': 'c-single-text-ellipsis'})]
    return hot_searches

def save_to_excel(hot_searches, savepath):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    sheet.title = 'Baidu Hot Searches'
    for i, search in enumerate(hot_searches, start=2):
        sheet.cell(row=i, column=1, value=search)
    workbook.save(savepath)

def visualize_hot_searches(hot_searches):
    plt.figure(figsize=(15, 10))
    x = range(len(hot_searches))
    y = list(reversed(range(1, len(hot_searches) + 1)))
    plt.barh(x, y, tick_label=hot_searches, height=0.8)
    plt.title('百度热搜排行榜')
    plt.xlabel('排名')
    plt.ylabel('关键词')
    plt.show()

# 主函数（省略了完整的错误处理代码）
if __name__ == '__main__':
    hot_searches = get_hot_searches()
    save_to_excel(hot_searches, '百度热搜.xlsx')
    visualize_hot_searches(hot_searches)

案例三：爬取斗鱼直播照片保存到本地目录

目标：从斗鱼直播平台爬取指定直播间的照片并保存到本地目录。

步骤：

发送HTTP GET请求获取直播间页面。
解析JSON响应数据（假设返回的是JSON格式），提取图片URL列表。
遍历图片URL列表，使用requests下载每张图片。
将下载的图片保存到本地指定目录。

核心代码：

import requests
import os

def download_images(room_id, save_dir):
    # 假设有一个API可以返回直播间的图片列表
    api_url = f"https://api.douyu.com/room/{room_id}/images"
    response = requests.get(api_url)
    if response.status_code == 200:
        image_urls = response.json()['image_urls']  # 假设返回的JSON中有一个image_urls字段
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)
        for i, url in enumerate(image_urls):
            image_data = requests.get(url).content
            with open(os.path.join(save_dir, f"image_{i}.jpg"), 'wb') as f:
                f.write(image_data)
    else:
        print("Failed to retrieve image URLs")

# 主函数
if __name__ == '__main__':
    room_id = '123456'  # 示例直播间ID
    save_dir = './douyu_images'
    download_images(room_id, save_dir)

案例四：爬取新浪微博热搜榜并发送邮件通知

目标：抓取新浪微博热搜榜数据，并通过邮件发送通知。

步骤：

发送HTTP GET请求获取热搜榜页面。
解析HTML或JSON响应数据（取决于新浪微博的API或页面结构），提取热搜词。
使用SMTP协议配置邮件服务器，编写邮件内容。
发送包含热搜词的邮件。

注意：由于新浪微博有反爬虫机制，实际操作中可能需要处理更复杂的HTTP请求头或使用其官方API（如果有的话）。

核心代码（假设使用SMTP发送邮件）：

import smtplib
from email.mime.text import MIMEText
from email.header import Header

# 假设已经通过某种方式获取了热搜词列表 hot_searches
hot_searches = ["热搜词1", "热搜词2", "热搜词3"]

def send_email(sender, password, receiver, subject, content):
    message = MIMEText(content, 'plain', 'utf-8')
    message['From'] = Header(sender, 'utf-8')
    message['To'] = Header(receiver, 'utf-8')
    message['Subject'] = Header(subject, 'utf-8')
    
    try:
        smtpObj = smtplib.SMTP_SSL('smtp.example.com', 465)  # 以SSL方式连接邮件服务器
        smtpObj.login(sender, password)
        smtpObj.sendmail(sender, [receiver, ], message.as_string())
        print("邮件发送成功")
    except smtplib.SMTPException:
        print("Error: 无法发送邮件")

# 邮件内容构造
subject = '新浪微博热搜榜通知'
content = '当前新浪微博热搜榜：\n' + '\n'.join(hot_searches)

# 主函数（省略了邮箱账号和密码）
if __name__ == '__main__':
    sender = 'your_email@example.com'
    password = 'your_password'
    receiver = 'receiver_email@example.com'
    send_email(sender, password, receiver, subject, content)

案例五：爬取知乎热门问题并进行文本分析

目标：抓取知乎热门问题，使用jieba进行中文分词，统计词频并进行可视化展示。

步骤（省略了网络爬虫部分，仅展示文本处理和可视化）：

假设已经通过爬虫获取了热门问题的文本内容。
使用jieba进行中文分词。
统计词频。
使用matplotlib或pandas进行词频可视化。

核心代码（文本处理和可视化部分）：

import jieba
from collections import Counter
import matplotlib.pyplot as plt

# 假设text是一个包含多个热门问题文本的字符串列表
texts = ["问题文本1", "问题文本2", "问题文本3"]

# 中文分词
words = []
for text in texts:
    words.extend(jieba.cut(text))

# 统计词频
word_counts = Counter(words)

# 可视化词频
plt.figure(figsize=(10, 8))
plt.bar(word_counts.most_common(20), color='skyblue')  # 假设只展示词频最高的20个词
plt.xlabel('词汇')
plt.ylabel('词频')
plt.title('知乎热门问题词频统计')
plt.xticks(rotation=45)  # 旋转x轴标签，避免重叠
plt.tight_layout()
plt.show()

案例六：爬取淘宝商品信息并进行价格监控

目标：抓取指定淘宝商品页面的信息，定期爬取并比较价格变化，发送价格变动通知。

步骤：

发送HTTP GET请求获取商品页面。
解析HTML或JSON响应数据，提取商品价格和其他所需信息。
将当前价格与之前的价格进行比较。
如果价格有变动，则发送通知（如邮件、短信等）。

注意：由于淘宝有复杂的反爬虫机制，实际操作中可能需要处理登录、cookies、代理IP等问题，或者使用其官方API（如果有的话）。

核心代码（省略了网络爬虫和通知发送的具体实现）：

# 假设previous_price是之前爬取到的商品价格
previous_price = 100.0

# 假设current_price是当前爬取到的商品价格（此处仅作为示例）
current_price = 120.0

def send_price_change_notification(price_change):
    # 发送通知的具体实现，如发送邮件、短信等
    pass

# 检查价格变动并发送通知
if current_price != previous_price:
    price_change_info = f"商品价格变动，从{previous_price}变为{current_price}"
    send_price_change_notification(price_change_info)
else:
    print("商品价格未变动")