bilibili-api 二级评论分页抓取问题解析-优快云博客

bilibili-api 二级评论分页抓取问题解析

【免费下载链接】bilibili-api 哔哩哔哩常用API调用。支持视频、番剧、用户、频道、音频等功能。原仓库地址：https://github.com/MoyuScript/bilibili-api 项目地址: https://gitcode.com/gh_mirrors/bi/bilibili-api

痛点：为什么二级评论抓取总是"差一点"？

你是否曾经遇到过这样的场景：使用bilibili-api抓取视频评论时，一级评论轻松获取，但一到二级评论（回复评论）就出现各种问题？要么是分页不准确，要么是数据缺失，甚至直接报错。这背后隐藏着B站API设计的复杂逻辑和分页机制的微妙差异。

本文将深入解析bilibili-api中二级评论分页抓取的核心问题，并提供完整的解决方案。

二级评论分页机制深度解析

传统分页 vs 游标分页

B站评论系统采用两种不同的分页机制：

mermaid

核心API接口对比

接口类型	API端点	分页参数	最大页数	认证要求
一级评论(旧)	`/x/v2/reply`	`pn`(页码)	未登录: 1页登录: 多页	可选
一级评论(新)	`/x/v2/reply/wbi/main`	`pagination_str`(游标)	无限制	必需
二级评论	`/x/v2/reply/reply`	`pn`(页码)+`ps`(页大小)	固定分页	可选

二级评论分页的核心问题

问题1：分页参数限制

async def get_sub_comments(self, page_index: int = 1, page_size: int = 10) -> dict:
    """
    获取子评论。即评论下的评论。
    
    Args:
        page_index (int, optional): 页码索引，从1开始。Defaults to 1.
        page_size (int, optional): 每页评论数。设置大于20的数值不会起作用。Defaults to 10.
    """

关键限制：page_size参数设置大于20不会起作用，B站API硬性限制每页最多返回20条二级评论。

问题2：总数统计不准确

二级评论的API返回数据中，总数统计字段(count)经常出现不准确的情况：

# 典型的问题场景
response = await comment.get_sub_comments(oid, type_, rpid, page_index=1)
total_count = response['page']['count']  # 这个值可能不准确!
real_comments = response['replies']      # 实际返回的评论列表

问题3：深度分页性能问题

当二级评论数量巨大时（如热门视频的顶级评论），传统分页方式在深度页码时会出现性能下降甚至超时。

完整解决方案

方案1：智能分页抓取器

from bilibili_api import comment, sync
from typing import List, Dict
import asyncio

class SubCommentCrawler:
    def __init__(self, credential=None):
        self.credential = credential
        self.max_retries = 3
        self.delay_between_requests = 1.0
    
    async def get_all_sub_comments(self, oid: int, type_: comment.CommentResourceType, 
                                 root_rpid: int, max_pages: int = 50) -> List[Dict]:
        """
        获取指定根评论下的所有二级评论
        
        Args:
            oid: 资源ID
            type_: 资源类型
            root_rpid: 根评论ID
            max_pages: 最大抓取页数（防止无限循环）
        """
        all_comments = []
        current_page = 1
        retry_count = 0
        
        comment_obj = comment.Comment(oid, type_, root_rpid, self.credential)
        
        while current_page <= max_pages and retry_count < self.max_retries:
            try:
                # 使用最大页大小20
                response = await comment_obj.get_sub_comments(
                    page_index=current_page, 
                    page_size=20
                )
                
                if not response.get('replies'):
                    break
                
                all_comments.extend(response['replies'])
                current_page += 1
                
                # 添加请求间隔，避免频率过高
                await asyncio.sleep(self.delay_between_requests)
                retry_count = 0
                
            except Exception as e:
                print(f"第{current_page}页抓取失败: {e}")
                retry_count += 1
                await asyncio.sleep(2)  # 失败后等待更长时间
        
        return all_comments

方案2：带重试机制的批量处理

async def batch_crawl_sub_comments(root_comments_list, crawler, 
                                 concurrent_tasks: int = 5):
    """
    批量处理多个根评论的二级评论抓取
    
    Args:
        root_comments_list: 根评论列表，每个元素为(oid, type_, rpid)元组
        crawler: SubCommentCrawler实例
        concurrent_tasks: 并发任务数
    """
    semaphore = asyncio.Semaphore(concurrent_tasks)
    
    async def process_single_root(oid, type_, rpid):
        async with semaphore:
            return await crawler.get_all_sub_comments(oid, type_, rpid)
    
    tasks = []
    for oid, type_, rpid in root_comments_list:
        task = process_single_root(oid, type_, rpid)
        tasks.append(task)
    
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

方案3：数据验证与去重

def validate_and_deduplicate_comments(comments_list):
    """
    验证评论数据完整性并去重
    """
    seen_rpids = set()
    valid_comments = []
    
    for comment_data in comments_list:
        if not isinstance(comment_data, dict):
            continue
            
        rpid = comment_data.get('rpid')
        if not rpid or rpid in seen_rpids:
            continue
            
        # 检查必要字段
        required_fields = ['rpid', 'member', 'content', 'ctime']
        if all(field in comment_data for field in required_fields):
            seen_rpids.add(rpid)
            valid_comments.append(comment_data)
    
    return valid_comments

实战案例：热门视频评论分析

场景描述

假设我们要分析一个热门视频（AV号：170001）的前100条顶级评论的所有回复。

实现代码

from bilibili_api import comment, sync, video
from collections import defaultdict

async def analyze_video_comments(aid: int):
    # 获取视频信息
    v = video.Video(aid=aid)
    video_info = await v.get_info()
    print(f"视频标题: {video_info['title']}")
    
    # 获取顶级评论
    root_comments = []
    page = 1
    while len(root_comments) < 100:
        response = await comment.get_comments(aid, comment.CommentResourceType.VIDEO, page)
        if not response.get('replies'):
            break
        root_comments.extend(response['replies'])
        page += 1
    
    print(f"获取到{len(root_comments)}条顶级评论")
    
    # 创建爬虫实例
    crawler = SubCommentCrawler()
    
    # 准备根评论列表
    root_list = [(aid, comment.CommentResourceType.VIDEO, cmt['rpid']) 
                for cmt in root_comments[:100]]
    
    # 批量抓取二级评论
    all_sub_comments = await batch_crawl_sub_comments(root_list, crawler, 3)
    
    # 统计数据分析
    comment_stats = defaultdict(int)
    total_sub_comments = 0
    
    for i, sub_comments in enumerate(all_sub_comments):
        if isinstance(sub_comments, list):
            comment_stats[root_comments[i]['rpid']] = len(sub_comments)
            total_sub_comments += len(sub_comments)
    
    print(f"总共获取到{total_sub_comments}条二级评论")
    print("评论回复数量分布:", dict(comment_stats))
    
    return root_comments, all_sub_comments

# 执行分析
root_cmts, sub_cmts = sync(analyze_video_comments(170001))

性能优化建议

1. 连接池管理

# 使用aiohttp连接池优化网络请求
import aiohttp

async def create_session():
    return aiohttp.ClientSession(
        connector=aiohttp.TCPConnector(limit=10, limit_per_host=5),
        timeout=aiohttp.ClientTimeout(total=30)
    )

2. 缓存策略

# 实现简单的请求缓存
from datetime import datetime, timedelta

class CommentCache:
    def __init__(self, ttl_minutes=10):
        self.cache = {}
        self.ttl = timedelta(minutes=ttl_minutes)
    
    def get(self, key):
        item = self.cache.get(key)
        if item and datetime.now() - item['timestamp'] < self.ttl:
            return item['data']
        return None
    
    def set(self, key, data):
        self.cache[key] = {
            'data': data,
            'timestamp': datetime.now()
        }

3. 速率限制

# 实现精确的请求速率控制
import time

class RateLimiter:
    def __init__(self, requests_per_second=2):
        self.interval = 1.0 / requests_per_second
        self.last_request = 0
    
    async def acquire(self):
        now = time.time()
        elapsed = now - self.last_request
        if elapsed < self.interval:
            await asyncio.sleep(self.interval - elapsed)
        self.last_request = time.time()

常见问题排查指南

Q1: 为什么二级评论抓取不全？

A: 检查是否设置了合适的page_size（最大20），并确保遍历所有页码。

Q2: 频繁请求被限制怎么办？

A: 添加请求间隔（1-2秒），使用速率限制器，考虑使用多个账号轮询。

Q3: 如何处理网络超时？

A: 实现重试机制，设置合理的超时时间，使用连接池管理。

Q4: 数据一致性如何保证？

A: 实现数据验证逻辑，处理去重，添加时间戳记录。

总结

bilibili-api的二级评论分页抓取确实存在一些技术挑战，但通过理解API限制、实现智能分页策略、添加适当的错误处理和性能优化，完全可以构建稳定可靠的评论抓取系统。

关键要点：

理解B站API的分页机制差异
正确处理page_size限制（最大20）
实现健壮的错误处理和重试机制
使用合适的并发控制避免请求限制
添加数据验证确保完整性

通过本文的解决方案，你应该能够解决大多数二级评论分页抓取的问题，构建出高效稳定的B站评论数据分析系统。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考