爬取豆瓣的电影评论
爬取豆瓣电影评论,有这么个需求;其实,是为了把豆瓣的评论都爬取下来;说下,思路,直接找到豆瓣相关电影的资源,然后去爬取,我找了好几个开源资料,都是爬取了一部分,就自己写了一个。代码如下:
def crawl_comments(pages=5):
"""爬取多页短评,pages=5 即 100 条左右"""
all_comments = []
start1 = 0
for i in range(pages):
start = start1 + i * 20
print(f"正在抓取第 {i+1} 页,start={start} ...")
html = fetch_page(start=start)
page_comments = parse_comments(html)
print(f"本页解析到 {len(page_comments)} 条评论")
all_comments.extend(page_comments)
# 随机延时,防止访问过快
sleep_time = random.uniform(1, 5)
print(f"休息 {sleep_time:.2f} 秒...")
time.sleep(sleep_time)
return all_comments
看下fetch_pages,代码如下:
def fetch_page(start=0, limit=20, sort="new_score"):
"""请求一页 HTML"""
params = {
"start": start,
"limit": limit,
"status": "P",
"sort": sort,
}
resp = requests.get(BASE_URL, headers=headers, params=params, timeout=10)
resp.raise_for_status()
return resp.text
看下parse_comments,代码如下:
def parse_comments(html):
"""从一页 HTML 里解析出所有短评,返回 list[dict]"""
soup = BeautifulSoup(html, "html.parser")
items = soup.select("div.review-item")
print("review-item 数量:", len(items))
results = []
for item in items:
block_text = item.get_text(" ", strip=True)
# 解析有用/没用/回应
m = re.search(r"(\d+)\s+(\d+)\s+(\d+)\s*回应", block_text)
if m:
useful = int(m.group(1))
useless = int(m.group(2))
replies = int(m.group(3))
else:
useful = useless = replies = 0
# 用户名
user_tag = item.select_one(".name")
user_name = user_tag.get_text(strip=True) if user_tag else None
# 时间
time_span = item.select_one("span.main-meta")
comment_time = time_span.get_text(strip=True) if time_span else None
# 摘要
summary_tag = item.select_one("div.short-content")
summary = summary_tag.get_text(" ", strip=True) if summary_tag else None
summary = clean_summary(summary)
# 标题 & 链接
title_tag = item.select_one("h2 a")
title = title_tag.get_text(strip=True) if title_tag else None
url = title_tag.get("href") if title_tag else None
results.append({
"title": title,
# "url": url,
"user": user_name,
"time": comment_time,
"useful": useful,
"useless": useless,
"replies": replies,
"summary": summary,
})
return results
小结
代码基本上就是上边这些,有部分是ai生成的,本地修改了下;其实, 这块并没有那么难,python我也是通过ai学习的,后来找了点教程看下,大概就是这个样子吧,长时间没用,很多东西有点陌生了,不过,还是要定期学习学习的。当然,底层也是c++,有兴趣,可以这里看看。OK,结束。
1万+

被折叠的 条评论
为什么被折叠?



