使用BeautifulSoup爬取烂番茄

最新推荐文章于 2025-11-21 10:15:15 发布

原创最新推荐文章于 2025-11-21 10:15:15 发布 · 1.4k 阅读

6 ·

CC 4.0 BY-SA版权

快乐学python 专栏收录该内容

2 篇文章

订阅专栏

该博客展示了如何使用Python的PyQuery和BeautifulSoup库从Rotten Tomatoes网站抓取特定电影的评论及评分。首先设置请求头，然后通过requests库获取网页内容。解析HTML找到评论和评分信息，存储到字典中。对于多页评论，通过循环遍历不同页面，抓取每一页的评论和评分，最后输出所有收集的数据。

部署运行你感兴趣的模型镜像

from pyquery import PyQuery as pq
from bs4 import BeautifulSoup
from bs4.element import Tag
import requests

headers = {

    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36',
    'Referer': 'https://www.google.com/'
}
if __name__ == '__main__':
    data = {}  # 用于存放所有的数据

    id = 'a_week_away'

    name = 'https://www.rottentomatoes.com/m/%s/reviews' % (id)

    doc = requests.get(url=name, headers=headers)

    soup = BeautifulSoup(doc.content, 'lxml')

    # 找到一共需要遍历多少页面
    # 这里需要考虑一种情况，就是有的电影的评论少是不需要翻页的


    if len(soup.select('.pageInfo')) == 0:
        pageCount = 1
    else:
        pageCount = int(soup.select('.pageInfo')[0].string.split(' ')[-1])

    # 先处理第一张Page

    for item in soup.select('.review_table_row '):
        newsoup = BeautifulSoup(item.prettify(), 'lxml')

        # 获取评论
        review = newsoup.select('.the_review')[0].string.strip()
        # 获取分数
        scores = newsoup.select('.review-link ')[0].prettify().strip().split('\n')
        # 注意这一步得到的是一个字符串列表,分数字符串放在倒数第二的位置上的，但是不一定会有分数

        if scores[-2].find('|') == -1:
            # 说明没有分数
            data[review] = ' '  # 没有分数对应一个空格

        else:
            data[review] = scores[-2].strip()

    # 处理剩下的页面

    for pageId in range(2, pageCount + 1):
        print(pageId)
        newUrl = 'https://www.rottentomatoes.com/m/%s/reviews?type=&sort=&page=%d' % (id, pageId)
        tdoc = requests.get(url=newUrl, headers=headers)
        tsoup = BeautifulSoup(tdoc.content, 'lxml')
        for item in tsoup.select('.review_table_row '):
            newsoup = BeautifulSoup(item.prettify(), 'lxml')

            # 获取评论
            review = newsoup.select('.the_review')[0].string.strip()
            # 获取分数
            scores = newsoup.select('.review-link ')[0].prettify().strip().split('\n')
            # 注意这一步得到的是一个字符串列表,分数字符串放在倒数第二的位置上的，但是不一定会有分数

            if scores[-2].find('|') == -1:
                # 说明没有分数
                data[review] = ' '  # 没有分数对应一个空格

            else:
                data[review] = scores[-2].strip()

    print(data)