kaggle TMDB5000电影数据分析和电影推荐模型

本文基于kaggle上的TMDB5000电影数据集进行分析,包括数据清理、探索性分析及电影推荐模型构建。研究了电影类型分布、利润与评分的关系、导演与演员影响力,以及利用genre、cast、director和keywords进行特征向量化,建立推荐模型。发现剧情、喜剧、惊悚、动作等类型电影受欢迎,科幻、动画电影盈利能力强,Oliver Stone、Steven Spielberg等导演作品受关注。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

数据来自kaggle上tmdb5000电影数据集,本次数据分析主要包括电影数据可视化和简单的电影推荐模型,如:
1.电影类型分配及其随时间的变化
2.利润、评分、受欢迎程度直接的关系
3.哪些导演的电影卖座或较好
4.最勤劳的演职人员
5.电影关键字分析
6.电影相似性推荐

数据分析

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import json
import warnings
warnings.filterwarnings('ignore')#忽略警告
movie = pd.read_csv('tmdb_5000_movies.csv')
credit = pd.read_csv('tmdb_5000_credits.csv')
movie.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{“id”: 28, “name”: “Action”}, {“id”: 12, “nam… http://www.avatarmovie.com/ 19995 [{“id”: 1463, “name”: “culture clash”}, {“id”:… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [{“name”: “Ingenious Film Partners”, “id”: 289… [{“iso_3166_1”: “US”, “name”: “United States o… 2009-12-10 2787965087 162.0 [{“iso_639_1”: “en”, “name”: “English”}, {“iso… Released Enter the World of Pandora. Avatar 7.2 11800
movie.tail(3)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
4800 0 [{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam… http://www.hallmarkchannel.com/signedsealeddel… 231617 [{“id”: 248, “name”: “date”}, {“id”: 699, “nam… en Signed, Sealed, Delivered “Signed, Sealed, Delivered” introduces a dedic… 1.444476 [{“name”: “Front Street Pictures”, “id”: 3958}… [{“iso_3166_1”: “US”, “name”: “United States o… 2013-10-13 0 120.0 [{“iso_639_1”: “en”, “name”: “English”}] Released NaN Signed, Sealed, Delivered 7.0 6
4801 0 [] http://shanghaicalling.com/ 126186 [] en Shanghai Calling When ambitious New York attorney Sam is sent t… 0.857008 [] [{“iso_3166_1”: “US”, “name”: “United States o… 2012-05-03 0 98.0 [{“iso_639_1”: “en”, “name”: “English”}] Released A New Yorker in Shanghai Shanghai Calling 5.7 7
4802 0 [{“id”: 99, “name”: “Documentary”}] NaN 25975 [{“id”: 1523, “name”: “obsession”}, {“id”: 224… en My Date with Drew Ever since the second grade when he first saw … 1.929883 [{“name”: “rusty bear entertainment”, “id”: 87… [{“iso_3166_1”: “US”, “name”: “United States o… 2005-08-05 0 90.0 [{“iso_639_1”: “en”, “name”: “English”}] Released NaN My Date with Drew 6.3 16
movie.info()#样本数量为4803,部分特征有缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB

样本数为4803,部分特征有缺失值,homepage,tagline缺损较多,但这俩不影响基本分析,release_date和runtime可以填充;仔细观察,部分样本的genres,keywords,production company特征值是[],需要注意。

credit.info

数据清理

数据特征中有很多特征为json格式,即类似于字典的键值对形式,为了方便后续处理,我们需要将其转换成便于python操作的str或者list形式,利于提取有用信息。

#movie genres电影流派,便于归类
movie['genres']=movie['genres'].apply(json.loads)
#apply function to axis in df,对df中某一行、列应用某种操作。
movie['genres'].head(1)
0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object
list(zip(movie.index,movie['genres']))[:2]
[(0,
  [{'id': 28, 'name': 'Action'},
   {'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 878, 'name': 'Science Fiction'}]),
 (1,
  [{'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 28, 'name': 'Action'}])]
for index,i in zip(movie.index,movie['genres']):
    list1=[]
    for j in range(len(i)):
        list1.append((i[j]['name']))# name:genres,Action...
    movie.loc[index,'genres']=str(list1)
movie.head(1)
#genres列已经不是json格式,而是将name将的value即电影类型提取出来重新赋值给genres
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… http://www.avatarmovie.com/ 19995 [{“id”: 1463, “name”: “culture clash”}, {“id”:… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [{“name”: “Ingenious Film Partners”, “id”: 289… [{“iso_3166_1”: “US”, “name”: “United States o… 2009-12-10 2787965087 162.0 [{“iso_639_1”: “en”, “name”: “English”}, {“iso… Released Enter the World of Pandora. Avatar 7.2 11800
#同样的方法应用到keywords列
movie['keywords'] = movie['keywords'].apply(json.loads)
for index,i in zip(movie.index,movie['keywords']):
    list2=[]
    for j in range(len(i)):
        list2.append(i[j]['name'])
    movie.loc[index,'keywords'] = str(list2)
#同理production_companies
movie['production_companies'] = movie['production_companies'].apply(json.loads)
for index,i in zip(movie.index,movie['production_companies']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_companies']=str(list3)
movie['production_countries'] = movie['production_countries'].apply(json.loads)
for index,i in zip(movie.index,movie['production_countries']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_countries']=str(list3)
movie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)
for index,i in zip(movie.index,movie['spoken_languages']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'spoken_languages']=str(list3)
movie.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… http://www.avatarmovie.com/ 19995 [‘culture clash’, ‘future’, ‘space war’, ‘spac… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [‘Ingenious Film Partners’, ‘Twentieth Century… [‘United States of America’, ‘United Kingdom’] 2009-12-10 2787965087 162.0 [‘English’, ‘Español’] Released Enter the World of Pandora. Avatar 7.2 11800
credit.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
movie_id title cast crew
0 19995 Avatar [{“cast_id”: 242, “character”: “Jake Sully”, “… [{“credit_id”: “52fe48009251416c750aca23”, “de…
credit['cast'] = credit['cast'].apply(json.loads)
for index,i in zip(credit.index,credit['cast']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    credit.loc[index,'cast']=str(list3)
credit['crew'] = credit['crew'].apply(json.loads)
#提取crew中director,增加电影导演一列,用作后续分析
def director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
credit['crew']=credit['crew'].apply(director)
credit.rename(columns={
  
  'crew':'director'},inplace=True)
credit.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
movie_id title cast director
0 19995 Avatar [‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney … James Cameron

观察movie中id和credit中movie_id相同,可以将两个表合并,将所有信息统一在一个表中。

fulldf = pd.merge(movie,credit,left_on='id',right_on='movie_id',how='left')
fulldf.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies spoken_languages status tagline title_x vote_average vote_count movie_id title_y cast director
0 237000000 [‘Ac
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值