利用python进入数据分析之MovieLens_1M数据分析

最新推荐文章于 2024-10-06 12:28:24 发布

若云流风

最新推荐文章于 2024-10-06 12:28:24 发布

阅读量3k

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习利用python进行数据分析文章标签： python numpy pandas 数据分析数据

本文链接：https://blog.youkuaiyun.com/ruoyunliufeng/article/details/78232141

python 同时被 3 个专栏收录

71 篇文章

订阅专栏

利用python进行数据分析

45 篇文章

订阅专栏

机器学习

36 篇文章

订阅专栏

本文通过分析电影评分数据集，展示了如何整合不同来源的数据，并利用Pandas库进行数据处理和分析。主要内容包括：数据读取与设置、数据合并操作、计算电影平均评分及分析活跃电影的性别评分差异。

数据设置

    In [26]:
  

import pandas as pd
import os
encoding = 'latin1'

upath = os.path.expanduser('ch02/movielens/users.dat')
rpath = os.path.expanduser('ch02/movielens/ratings.dat')
mpath = os.path.expanduser('ch02/movielens/movies.dat')

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
mnames = ['movie_id', 'title', 'genres']

users = pd.read_csv(upath, sep='::', header=None, names=unames, encoding=encoding)
ratings = pd.read_csv(rpath, sep='::', header=None, names=rnames, encoding=encoding)
movies = pd.read_csv(mpath, sep='::', header=None, names=mnames, encoding=encoding)

    In [6]:
  

users[:5]

      Out[6]:
    

	user_id	gender	age	occupation	zip
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

    In [7]:
  

ratings[:5]

      Out[7]:
    

	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

    In [8]:
  

movies[:5]

      Out[8]:
    

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

    In [9]:
  

ratings

      Out[9]:
    

	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291
5	1	1197	3	978302268
6	1	1287	5	978302039
7	1	2804	5	978300719
8	1	594	4	978302268
9	1	919	4	978301368
10	1	595	5	978824268
11	1	938	4	978301752
12	1	2398	4	978302281
13	1	2918	4	978302124
14	1	1035	5	978301753
15	1	2791	4	978302188
16	1	2687	3	978824268
17	1	2018	4	978301777
18	1	3105	5	978301713
19	1	2797	4	978302039
20	1	2321	3	978302205
21	1	720	3	978300760
22	1	1270	5	978300055
23	1	527	5	978824195
24	1	2340	3	978300103
25	1	48	5	978824351
26	1	1097	4	978301953
27	1	1721	4	978300055
28	1	1545	4	978824139
29	1	745	3	978824268
...	...	...	...	...
1000179	6040	2762	4	956704584
1000180	6040	1036	3	956715455
1000181	6040	508	4	956704972
1000182	6040	1041	4	957717678
1000183	6040	3735	4	960971654
1000184	6040	2791	4	956715569
1000185	6040	2794	1	956716438
1000186	6040	527	5	956704219
1000187	6040	2003	1	956716294
1000188	6040	535	4	964828734
1000189	6040	2010	5	957716795
1000190	6040	2011	4	956716113
1000191	6040	3751	4	964828782
1000192	6040	2019	5	956703977
1000193	6040	541	4	956715288
1000194	6040	1077	5	964828799
1000195	6040	1079	2	956715648
1000196	6040	549	4	956704746
1000197	6040	2020	3	956715288
1000198	6040	2021	3	956716374
1000199	6040	2022	5	956716207
1000200	6040	2028	5	956704519
1000201	6040	1080	4	957717322
1000202	6040	1089	4	956704996
1000203	6040	1090	3	956715518
1000204	6040	1091	1	956716541
1000205	6040	1094	5	956704887
1000206	6040	562	5	956704746
1000207	6040	1096	4	956715648
1000208	6040	1097	4	956715569

1000209 rows × 4 columns

数据合并

    In [10]:
  

data = pd.merge(pd.merge(ratings, users), movies)
data

      Out[10]:
    

	user_id	movie_id	rating	timestamp	gender	age	occupation	zip	title	genres
0	1	1193	5	978300760	F	1	10	48067	One Flew Over the Cuckoo's Nest (1975)	Drama
1	2	1193	5	978298413	M	56	16	70072	One Flew Over the Cuckoo's Nest (1975)	Drama
2	12	1193	4	978220179	M	25	12	32793	One Flew Over the Cuckoo's Nest (1975)	Drama
3	15	1193	4	978199279	M	25	7	22903	One Flew Over the Cuckoo's Nest (1975)	Drama
4	17	1193	5	978158471	M	50	1	95350	One Flew Over the Cuckoo's Nest (1975)	Drama
5	18	1193	4	978156168	F	18	3	95825	One Flew Over the Cuckoo's Nest (1975)	Drama
6	19	1193	5	982730936	M	1	10	48073	One Flew Over the Cuckoo's Nest (1975)	Drama
7	24	1193	5	978136709	F	25	7	10023	One Flew Over the Cuckoo's Nest (1975)	Drama
8	28	1193	3	978125194	F	25	1	14607	One Flew Over the Cuckoo's Nest (1975)	Drama
9	33	1193	5	978557765	M	45	3	55421	One Flew Over the Cuckoo's Nest (1975)	Drama
10	39	1193	5	978043535	M	18	4	61820	One Flew Over the Cuckoo's Nest (1975)	Drama
11	42	1193	3	978038981	M	25	8	24502	One Flew Over the Cuckoo's Nest (1975)	Drama
12	44	1193	4	978018995	M	45	17	98052	One Flew Over the Cuckoo's Nest (1975)	Drama
13	47	1193	4	977978345	M	18	4	94305	One Flew Over the Cuckoo's Nest (1975)	Drama
14	48	1193	4	977975061	M	25	4	92107	One Flew Over the Cuckoo's Nest (1975)	Drama
15	49	1193	4	978813972	M	18	12	77084	One Flew Over the Cuckoo's Nest (1975)	Drama
16	53	1193	5	977946400	M	25	0	96931	One Flew Over the Cuckoo's Nest (1975)	Drama
17	54	1193	5	977944039	M	50	1	56723	One Flew Over the Cuckoo's Nest (1975)	Drama
18	58	1193	5	977933866	M	25	2	30303	One Flew Over the Cuckoo's Nest (1975)	Drama
19	59	1193	4	977934292	F	50	1	55413	One Flew Over the Cuckoo's Nest (1975)	Drama
20	62	1193	4	977968584	F	35	3	98105	One Flew Over the Cuckoo's Nest (1975)	Drama
21	80	1193	4	977786172	M	56	1	49327	One Flew Over the Cuckoo's Nest (1975)	Drama
22	81	1193	5	977785864	F	25	0	60640	One Flew Over the Cuckoo's Nest (1975)	Drama
23	88	1193	5	977694161	F	45	1	02476	One Flew Over the Cuckoo's Nest (1975)	Drama
24	89	1193	5	977683596	F	56	9	85749	One Flew Over the Cuckoo's Nest (1975)	Drama
25	95	1193	5	977626632	M	45	0	98201	One Flew Over the Cuckoo's Nest (1975)	Drama
26	96	1193	3	977621789	F	25	16	78028	One Flew Over the Cuckoo's Nest (1975)	Drama
27	99	1193	2	982791053	F	1	10	19390	One Flew Over the Cuckoo's Nest (1975)	Drama
28	102	1193	5	1040737607	M	35	19	20871	One Flew Over the Cuckoo's Nest (1975)	Drama
29	104	1193	2	977546620	M	25	12	00926	One Flew Over the Cuckoo's Nest (1975)	Drama
...	...	...	...	...	...	...	...	...	...	...
1000179	4933	3084	3	962757020	M	25	15	94040	Home Page (1999)	Documentary
1000180	4802	2218	2	1014866656	M	56	1	40601	Juno and Paycock (1930)	Drama
1000181	4812	2308	2	962932391	M	18	14	25301	Detroit 9000 (1973)	Action\|Crime
1000182	4874	624	4	962781918	F	25	4	70808	Condition Red (1995)	Action\|Drama\|Thriller
1000183	5059	1434	4	962484364	M	45	16	22652	Stranger, The (1994)	Action
1000184	5947	1434	4	957190428	F	45	16	97215	Stranger, The (1994)	Action
1000185	5077	1868	3	962417299	M	25	2	20037	Truce, The (1996)	Drama\|War
1000186	5944	1868	1	957197520	F	18	10	27606	Truce, The (1996)	Drama\|War
1000187	5105	404	3	962337582	M	50	7	18977	Brother Minister: The Assassination of Malcolm...	Documentary
1000188	5185	404	4	963402617	F	35	4	44485	Brother Minister: The Assassination of Malcolm...	Documentary
1000189	5532	404	5	959619841	M	25	17	27408	Brother Minister: The Assassination of Malcolm...	Documentary
1000190	5543	404	3	960127592	M	25	17	97401	Brother Minister: The Assassination of Malcolm...	Documentary
1000191	5220	2543	3	961546137	M	25	7	91436	Six Ways to Sunday (1997)	Comedy
1000192	5754	2543	4	958272316	F	18	1	60640	Six Ways to Sunday (1997)	Comedy
1000193	5227	591	3	961475931	M	18	10	64050	Tough and Deadly (1995)	Action\|Drama\|Thriller
1000194	5795	591	1	958145253	M	25	1	92688	Tough and Deadly (1995)	Action\|Drama\|Thriller
1000195	5313	3656	5	960920392	M	56	0	55406	Lured (1947)	Crime
1000196	5328	2438	4	960838075	F	25	4	91740	Outside Ozona (1998)	Drama\|Thriller
1000197	5334	3323	3	960796159	F	56	13	46140	Chain of Fools (2000)	Comedy\|Crime
1000198	5334	127	1	960795494	F	56	13	46140	Silence of the Palace, The (Saimt el Qusur) (1...	Drama
1000199	5334	3382	5	960796159	F	56	13	46140	Song of Freedom (1936)	Drama
1000200	5420	1843	3	960156505	F	1	19	14850	Slappy and the Stinkers (1998)	Children's\|Comedy
1000201	5433	286	3	960240881	F	35	17	45014	Nemesis 2: Nebula (1995)	Action\|Sci-Fi\|Thriller
1000202	5494	3530	4	959816296	F	35	17	94306	Smoking/No Smoking (1993)	Comedy
1000203	5556	2198	3	959445515	M	45	6	92103	Modulations (1998)	Documentary
1000204	5949	2198	5	958846401	M	18	17	47901	Modulations (1998)	Documentary
1000205	5675	2703	3	976029116	M	35	14	30030	Broken Vessels (1998)	Drama
1000206	5780	2845	1	958153068	M	18	17	92886	White Boys (1999)	Drama
1000207	5851	3607	5	957756608	F	18	20	55410	One Little Indian (1973)	Comedy\|Drama\|Western
1000208	5938	2909	4	957273353	M	25	1	35401	Five Wives, Three Secretaries and Me (1998)	Documentary

1000209 rows × 10 columns

    In [11]:
  

data.ix[0]

D:\python2713\lib\anaconda_install\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  """Entry point for launching an IPython kernel.

      Out[11]:
    

user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

计算电影平均分

    In [34]:
  

import sys
reload(sys)
sys.setdefaultencoding('latin1')
mean_ratings = data.pivot_table('rating', index='title',columns='gender', aggfunc='mean')

    In [38]:
  

mean_ratings[:5]

      Out[38]:
    

gender	F	M
title
$1,000,000 Duck (1971)	3.375000	2.761905
'Night Mother (1986)	3.388889	3.352941
'Til There Was You (1997)	2.675676	2.733333
'burbs, The (1989)	2.793478	2.962085
...And Justice for All (1979)	3.828571	3.689024

    In [39]:
  

ratings_by_title = data.groupby('title').size()  #对title进行分组

    In [40]:
  

ratings_by_title[:5]

      Out[40]:
    

title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64

    In [41]:
  

active_titles = ratings_by_title.index[ratings_by_title >= 250] # 获得评论数据大于250的电影

    In [42]:
  

active_titles[:10]

      Out[42]:
    

Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
       u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
       u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
       u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
       u'2001: A Space Odyssey (1968)', u'2010 (1984)'],
      dtype='object', name=u'title')

    In [43]:
  

mean_ratings = mean_ratings.ix[active_titles]
mean_ratings

      Out[43]:
    

gender	F	M
title
'burbs, The (1989)	2.793478	2.962085
10 Things I Hate About You (1999)	3.646552	3.311966
101 Dalmatians (1961)	3.791444	3.500000
101 Dalmatians (1996)	3.240000	2.911215
12 Angry Men (1957)	4.184397	4.328421
13th Warrior, The (1999)	3.112000	3.168000
2 Days in the Valley (1996)	3.488889	3.244813
20,000 Leagues Under the Sea (1954)	3.670103	3.709205
2001: A Space Odyssey (1968)	3.825581	4.129738
2010 (1984)	3.446809	3.413712
28 Days (2000)	3.209424	2.977707
39 Steps, The (1935)	3.965517	4.107692
54 (1998)	2.701754	2.782178
7th Voyage of Sinbad, The (1958)	3.409091	3.658879
8MM (1999)	2.906250	2.850962
About Last Night... (1986)	3.188679	3.140909
Absent Minded Professor, The (1961)	3.469388	3.446809
Absolute Power (1997)	3.469136	3.327759
Abyss, The (1989)	3.659236	3.689507
Ace Ventura: Pet Detective (1994)	3.000000	3.197917
Ace Ventura: When Nature Calls (1995)	2.269663	2.543333
Addams Family Values (1993)	3.000000	2.878531
Addams Family, The (1991)	3.186170	3.163498
Adventures in Babysitting (1987)	3.455782	3.208122
Adventures of Buckaroo Bonzai Across the 8th Dimension, The (1984)	3.308511	3.402321
Adventures of Priscilla, Queen of the Desert, The (1994)	3.989071	3.688811
Adventures of Robin Hood, The (1938)	4.166667	3.918367
African Queen, The (1951)	4.324232	4.223822
Age of Innocence, The (1993)	3.827068	3.339506
Agnes of God (1985)	3.534884	3.244898
...	...	...
White Men Can't Jump (1992)	3.028777	3.231061
Who Framed Roger Rabbit? (1988)	3.569378	3.713251
Who's Afraid of Virginia Woolf? (1966)	4.029703	4.096939
Whole Nine Yards, The (2000)	3.296552	3.404814
Wild Bunch, The (1969)	3.636364	4.128099
Wild Things (1998)	3.392000	3.459082
Wild Wild West (1999)	2.275449	2.131973
William Shakespeare's Romeo and Juliet (1996)	3.532609	3.318644
Willow (1988)	3.658683	3.453543
Willy Wonka and the Chocolate Factory (1971)	4.063953	3.789474
Witness (1985)	4.115854	3.941504
Wizard of Oz, The (1939)	4.355030	4.203138
Wolf (1994)	3.074074	2.899083
Women on the Verge of a Nervous Breakdown (1988)	3.934307	3.865741
Wonder Boys (2000)	4.043796	3.913649
Working Girl (1988)	3.606742	3.312500
World Is Not Enough, The (1999)	3.337500	3.388889
Wrong Trousers, The (1993)	4.588235	4.478261
Wyatt Earp (1994)	3.147059	3.283898
X-Files: Fight the Future, The (1998)	3.489474	3.493797
X-Men (2000)	3.682310	3.851702
Year of Living Dangerously (1982)	3.951220	3.869403
Yellow Submarine (1968)	3.714286	3.689286
You've Got Mail (1998)	3.542424	3.275591
Young Frankenstein (1974)	4.289963	4.239177
Young Guns (1988)	3.371795	3.425620
Young Guns II (1990)	2.934783	2.904025
Young Sherlock Holmes (1985)	3.514706	3.363344
Zero Effect (1998)	3.864407	3.723140
eXistenZ (1999)	3.098592	3.289086

1216 rows × 2 columns

    In [44]:
  

mean_ratings = mean_ratings.rename(index={'Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)':
                           'Seven Samurai (Shichinin no samurai) (1954)'})

    In [45]:
  

top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)# 获取女性观众最喜欢的电影
top_female_ratings[:10]

      Out[45]:
    

gender	F	M
title
Close Shave, A (1995)	4.644444	4.473795
Wrong Trousers, The (1993)	4.588235	4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.572650	4.464589
Wallace & Gromit: The Best of Aardman Animation (1996)	4.563107	4.385075
Schindler's List (1993)	4.562602	4.491415
Shawshank Redemption, The (1994)	4.539075	4.560625
Grand Day Out, A (1992)	4.537879	4.293255
To Kill a Mockingbird (1962)	4.536667	4.372611
Creature Comforts (1990)	4.513889	4.272277
Usual Suspects, The (1995)	4.513317	4.518248