每日实战：python爬虫之网页跳转-以某博为例_请求的网页会跳转如何爬虫-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_59078658/article/details/145970735

一、项目背景与核心需求

通过逆向分析微博热榜接口，实现实时热搜数据抓取，重点解决：

话题跳转链接参数缺失问题
页面数据清洗规范化处理
多维度数据采集存储

二、网页跳转爬虫实现原理

2.1 跳转链接生成逻辑

原始热搜词 → "雷军刚知道柯洁定了SU7Ultra"
处理流程：
1. 添加话题标识 → #雷军刚知道柯洁定了SU7Ultra#
2. URL编码 → %23雷军刚知道柯洁定了SU7Ultra%23
3. 添加搜索参数 → &t=31
生成链接：
https://s.weibo.com/weibo?q=%23雷军...%23&t=31

2.2 关键技术实现

quote编码：处理中文字符和特殊符号
参数补全：t=31表示综合排序模式
反爬对抗：随机延迟+完整请求头模拟

三、Python核心工具包说明

工具包	作用描述	版本要求
requests	网络请求库	≥2.24.0
BeautifulSoup	HTML解析库	≥4.9.3
pandas	数据存储与导出	≥1.1.3
urllib.parse	URL编码处理	内置
random/time	反爬时间控制	内置

安装命令：

pip install requests beautifulsoup4 pandas

四、代码核心功能解析

4.1 主函数逻辑流程

graph TD
    A[获取热榜API数据] --> B[遍历热搜条目]
    B --> C{生成跳转链接}
    C -->|成功| D[请求详情页]
    D --> E[解析阅读量/讨论量]
    D --> F[提取导语内容]
    E --> G[数据清洗]
    F --> G
    G --> H[保存Excel]

4.2 关键模块说明

(1) 请求头配置

headers = {
    'cookie': '替换有效cookie',  # 身份验证核心参数
    'user-agent': 'Mozilla/5.0...',  # 模拟浏览器环境
    'referer': 'https://weibo.com/'  # 反爬必要参数
}

(2) 数据清洗模块

# 阅读量清洗（去除"阅读量"前缀）
detail_data['阅读量'] = spans[0].text.strip()[3:] 

# 导语清洗（直接提取首个p标签）
detail_data['导语'] = intro_p[0].text.strip()

(3) 异常处理机制

try:
    # 核心业务代码
except JSONDecodeError:
    # 处理接口返回异常
except IndexError:
    # 处理页面元素缺失
except Exception as e:
    # 全局异常捕获

五、完整代码

import json
import time
import random
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import quote

# 通用请求头配置（需自行填写）
headers = {
    'cookie': '',  # 从浏览器开发者工具复制有效cookie
    'user-agent': '', # 从浏览器开发者工具复制user-agent
    'referer': 'https://weibo.com/'
}


def get_hot_detail(keyword):
    """获取单个热搜的详细信息"""
    detail_data = {
        '阅读量': 'N/A',
        '讨论量': 'N/A',
        '导语': 'N/A'
    }

    try:
        # 构造带编码的搜索URL
        encoded_word = quote(f"#{keyword}#")  # <--- 用 # 包裹关键词
        detail_url = f"https://s.weibo.com/weibo?q={encoded_word}&t=31"
        print('跳转链接', detail_url)

        # 添加随机延迟防止被封
        time.sleep(random.uniform(1, 3))

        # 发起详情页请求
        response = requests.get(detail_url, headers=headers)
        if response.status_code != 200:
            return detail_data

        soup = BeautifulSoup(response.text, 'lxml')

        # 提取阅读量数据和讨论量数据
        total_div = soup.find('div', attrs={'class': 'total'})

        if total_div:
            spans = total_div.find_all('span')
            # 改进点1：索引提取 + 异常处理
            try:
                detail_data['阅读量'] = spans[0].text.strip()[3:]  # 第一个span是阅读量
                detail_data['讨论量'] = spans[1].text.strip()[3:]  # 第二个span是讨论量
                # print(detail_data['阅读量'])

            except IndexError:
                print("阅读量和讨论量页面结构异常，未找到完整数据")

        # 提取导语内容
        intro_div = soup.find('div', attrs={'class': 'card card-topic-lead s-pg16'})  # 网站标签：<div class="card card-topic-lead s-pg16">
        if intro_div:
            try:
                intro_p = intro_div.find_all('p')
                if intro_p:
                    detail_data['导语'] = intro_p[0].text.strip()[3:]  # 数据清洗

            except IndexError:
                print("导语页面结构异常，未找到完整数据")

    except Exception as e:
        print(f"获取详情页数据失败: {str(e)}")

    return detail_data


# 主爬虫逻辑
try:
    # 获取热榜列表
    response = requests.get(
        url="https://weibo.com/ajax/side/hotSearch",
        headers=headers
    )
    response.raise_for_status()

    data = json.loads(response.text)
    hot_list = []

    # 遍历处理每个热搜条目
    for idx, item in enumerate(data['data']['realtime'], 1):
        # 基础数据
        hot_item = {
            '排名': idx,
            '标题': item.get('word', '无'),
            '热度值': item.get('num', 0),
            '标签': item.get('label_name', '无')
        }

        # 获取详细数据
        detail = get_hot_detail(item['word'])
        hot_item.update(detail)

        hot_list.append(hot_item)
        print(f"已处理第{idx}条热搜: {item['word']}")

    # 数据存储
    df = pd.DataFrame(hot_list)
    df.to_excel('微博热榜-完整数据.xlsx', index=False)
    print("数据已保存到 微博热榜-完整数据.xlsx")

except requests.exceptions.RequestException as e:
    print(f"网络请求异常: {str(e)}")
except json.JSONDecodeError:
    print("JSON解析失败，响应内容:", response.text[:200])
except KeyError as e:
    print(f"关键字段缺失: {str(e)}，原始数据: {data}")
except Exception as e:
    print(f"未处理异常: {str(e)}")

六、爬虫规约与注意事项

6.1 合法合规要求

严格遵守微博平台robots.txt协议
单IP请求频率≤15次/分钟
禁止绕过反爬机制获取非公开数据

6.2 数据使用规范

存储时间不超过24小时
需删除用户敏感信息（如用户ID）
数据展示需注明来源：数据来自微博公开接口

6.3 增强健壮性建议

# 增加代理池支持
proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080'
}

# 增加重试机制
from requests.adapters import HTTPAdapter
session = requests.Session()
session.mount('http://', HTTPAdapter(max_retries=3))