B站评论爬虫与情感分析全攻略，Python学习之Day05学习（定制数据对象,面向对象）。-优快云博客

B站评论爬虫实战：从数据获取到情感分析

数据获取：爬虫基础与B站API分析

B站评论数据可通过官方API或模拟请求获取。官方API需携带cookie和csrf_token，接口示例为：
https://api.bilibili.com/x/v2/reply/main?oid={视频aid}&type=1
返回数据为JSON格式，包含评论内容、用户信息、点赞数等字段。

模拟浏览器请求需使用requests库，设置请求头伪装浏览器：

import requests

headers = {
    "User-Agent": "Mozilla/5.0",
    "Cookie": "用户cookie"
}
response = requests.get(api_url, headers=headers)

数据解析与存储

使用json模块解析API返回数据，提取关键字段：

import json

data = json.loads(response.text)
comments = [reply["content"]["message"] for reply in data["data"]["replies"]]

存储建议使用pandas保存为结构化数据：

import pandas as pd

df = pd.DataFrame(comments, columns=["comment"])
df.to_csv("bilibili_comments.csv", index=False, encoding="utf-8-sig")

情感分析模型构建

基于预训练模型（如SnowNLP或BERT）快速实现情感分析：

from snownlp import SnowNLP

def analyze_sentiment(text):
    return SnowNLP(text).sentiments

df["sentiment"] = df["comment"].apply(analyze_sentiment)

使用matplotlib可视化情感分布：

import matplotlib.pyplot as plt

plt.hist(df["sentiment"], bins=20)
plt.xlabel("Sentiment Score")
plt.ylabel("Frequency")
plt.show()

反爬虫策略与优化

随机延迟：使用time.sleep(random.uniform(1,3))避免请求过频
代理IP池：通过服务商或自建代理应对IP封锁
分页处理：循环请求next字段实现自动翻页

高级应用：主题建模

通过jieba分词和sklearn的TF-IDF提取关键词：

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=100)
X = tfidf.fit_transform(df["comment"])

法律与伦理注意事项

遵守B站robots.txt协议
数据仅用于研究，禁止商业用途
匿名化处理用户信息