利用Scrapy框架爬取豆瓣电影新片榜：数据清洗与热点分析实战-优快云博客

本文链接：https://blog.youkuaiyun.com/shanwei_spider/article/details/148821378

豆瓣电影作为国内领先的电影评分和推荐平台，拥有丰富的电影数据，包括影评、评分、新片榜单等内容。豆瓣电影的新片榜单是一个非常受欢迎的功能，它列出了当前上映或即将上映的电影。通过爬虫抓取这些数据，并进行数据清洗与热点分析，可以帮助我们了解电影市场的趋势，找到受欢迎的电影类型及其特点。

本篇文章将带领你利用Scrapy框架来爬取豆瓣电影的新片榜单，并通过数据清洗与分析，深入探索电影的热点信息。

一、项目概述

目标

使用Scrapy框架爬取豆瓣电影的新片榜数据。
对获取的数据进行清洗与处理，提取出电影的名称、上映日期、评分、导演、主演等信息。
分析电影的热点类型，找出市场上最受欢迎的电影趋势。

工具

Scrapy：用于抓取和解析数据。
pandas：用于数据清洗和处理。
matplotlib、seaborn：用于数据可视化。
Jupyter Notebook：用于数据分析与展示。

目标数据

电影名称、评分、上映日期、导演、主演等信息。
电影类型（如动作、爱情、科幻等）及其分布。

二、环境搭建与Scrapy项目初始化

2.1 安装Scrapy

首先，我们需要安装Scrapy，可以使用pip进行安装：

pip install scrapy

2.2 初始化Scrapy项目

在终端中，进入到你希望存储项目的目录，然后运行以下命令来初始化一个Scrapy项目：

scrapy startproject douban_movies

执行后，会生成一个名为 douban_movies 的项目文件夹，其中包含以下结构：

douban_movies/
    scrapy.cfg
    douban_movies/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

2.3 创建Spider

在 douban_movies/spiders/ 目录下，我们将创建一个新的爬虫（Spider）来抓取豆瓣电影新片榜的数据。

cd douban_movies/spiders
touch new_movies.py

三、编写爬虫：抓取豆瓣电影新片榜数据

3.1 分析豆瓣电影新片榜页面

豆瓣电影的新片榜页面的URL为：

https://movie.douban.com/new_movies

通过浏览器查看页面源码，我们可以发现电影的基本信息（如电影名称、评分、上映日期、导演、主演等）是以HTML表格或列表的形式展示的。

3.2 编写爬虫代码

我们将编写爬虫代码，抓取新片榜页面中的相关信息，包括电影名称、评分、导演、主演和上映日期等。

`new_movies.py`

import scrapy
from douban_movies.items import DoubanMoviesItem

class NewMoviesSpider(scrapy.Spider):
    name = "new_movies"
    allowed_domains = ["douban.com"]
    start_urls = [
        "https://movie.douban.com/new_movies"
    ]

    def parse(self, response):
        # 提取电影信息
        for movie in response.xpath('//div[@class="movie-item film-channel"]'):
            item = DoubanMoviesItem()
            
            item['title'] = movie.xpath('.//div[@class="movie-item-title"]/a/text()').get()
            item['url'] = movie.xpath('.//div[@class="movie-item-title"]/a/@href').get()
            item['rating'] = movie.xpath('.//span[@class="rating_num"]/text()').get()
            item['release_date'] = movie.xpath('.//div[@class="movie-item-pub"]/text()').get().strip()
            item['director'] = movie.xpath('.//div[@class="movie-item-director"]/text()').get()
            item['actors'] = movie.xpath('.//div[@class="movie-item-actors"]/text()').get()
            
            yield item

        # 如果有下一页，继续抓取
        next_page = response.xpath('//span[@class="next"]/a/@href').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

3.3 代码解析

start_urls：指定爬虫的起始URL，即豆瓣电影的新片榜页面。
parse 方法：抓取页面内容，并提取出电影的标题、评分、上映日期、导演、主演等信息。
next_page：如果页面中有“下一页”的链接，爬虫会继续抓取下一页的内容。

3.4 定义Item

在 douban_movies/items.py 中，我们定义爬取到的数据结构（Item）：

import scrapy

class DoubanMoviesItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    rating = scrapy.Field()
    release_date = scrapy.Field()
    director = scrapy.Field()
    actors = scrapy.Field()

3.5 配置 Pipelines（可选）

如果你想将抓取到的数据保存到数据库或文件中，可以在 pipelines.py 中配置数据存储的逻辑。例如，保存到CSV文件：

import csv

class DoubanMoviesPipeline:
    def open_spider(self, spider):
        self.file = open('douban_new_movies.csv', mode='w', newline='', encoding='utf-8')
        self.writer = csv.writer(self.file)
        self.writer.writerow(['Title', 'URL', 'Rating', 'Release Date', 'Director', 'Actors'])

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        self.writer.writerow([item['title'], item['url'], item['rating'], item['release_date'], item['director'], item['actors']])
        return item

3.6 配置settings

在settings.py中启用Pipeline：

ITEM_PIPELINES = {
    'douban_movies.pipelines.DoubanMoviesPipeline': 1,
}

四、运行爬虫

在项目的根目录下，使用以下命令运行爬虫：

scrapy crawl new_movies

爬虫将开始抓取豆瓣电影新片榜的数据，并保存到douban_new_movies.csv文件中。

五、数据清洗与分析

5.1 读取数据并清洗

使用pandas读取爬取的数据，并进行数据清洗：

import pandas as pd

# 读取CSV文件
df = pd.read_csv('douban_new_movies.csv')

# 查看数据
print(df.head())

# 清洗数据
df['release_date'] = df['release_date'].apply(lambda x: x.split('：')[-1].strip())
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

# 填充缺失值
df['rating'].fillna(df['rating'].mean(), inplace=True)

# 查看清洗后的数据
print(df.head())

5.2 数据分析与可视化

5.2.1 电影评分分布

我们可以分析电影评分的分布，看看评分较高的电影有哪些。

import matplotlib.pyplot as plt
import seaborn as sns

# 设置绘图风格
sns.set(style="whitegrid")

# 绘制评分分布图
plt.figure(figsize=(10, 6))
sns.histplot(df['rating'], kde=True, bins=10, color='skyblue')

plt.title('豆瓣新片评分分布', fontsize=16)
plt.xlabel('评分', fontsize=12)
plt.ylabel('电影数量', fontsize=12)

plt.tight_layout()
plt.show()

5.2.2 上映日期分析

你可以进一步分析电影的上映日期，找出特定月份上映的电影数量，或者按电影类型进行分析。

# 按月份统计电影数量
df['release_month'] = pd.to_datetime(df['release_date'], errors='coerce').dt.month
release_month_count = df['release_month'].value_counts().sort_index()

# 绘制月份分布图
plt.figure(figsize=(10, 6))
release_month_count.plot(kind='bar', color='salmon')

plt.title('豆瓣新片上映月份分布', fontsize=16)
plt.xlabel('月份', fontsize=12)
plt.ylabel('电影数量', fontsize=12)

plt.tight_layout()
plt.show()