0. 理论
在散点图上找出k个最近邻居,让他们投票确定分类,类别判定为离它最近的k个观察值中所占比例最大的分类。
1.用KNN预测电影评分
import pandas as pd
import numpy as np
r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习:从入门到实践》源代码/ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()
|
user_id |
movie_id |
rating |
0 |
0 |
50 |
5 |
1 |
0 |
172 |
5 |
2 |
0 |
133 |
1 |
3 |
196 |
242 |
3 |
4 |
186 |
302 |
3 |
movieProperties = ratings.groupby('movie_id')['rating'].agg(['size', 'mean']).reset_index()
movieProperties.head()
|
movie_id |
size |
mean |
0 |
1 |
452 |
3.878319 |
1 |
2 |
131 |
3.206107 |
2 |
3 |
90 |
3.033333 |
3 |
4 |
209 |
3.550239 |
4 |
5 |
86 |
3.302326 |
movieProperties['size'] = (movieProperties['size'