1.explode函数
将一行数据展开成为多行
user_id movie_ids ratings
0 1 [[3186, 1721, 1270, 1022, 2340, 1836, 3408, 12... [[4, 4, 5, 5, 3, 5, 4, 4], [4, 5, 5, 3, 5, 4, ...
1 10 [[597, 858, 743, 1210, 1948, 2312, 3751, 1282]... [[4, 3, 3, 4, 4, 5, 5, 5], [3, 3, 4, 4, 5, 5, ...
2 100 [[260, 1676, 1198, 541, 1210, 3948, 3536, 2567... [[4, 3, 4, 3, 4, 3, 1, 1], [3, 4, 3, 4, 3, 1, ...
3 1000 [[971, 260, 2990, 2973, 1210, 3068, 3153, 1198... [[4, 5, 4, 3, 5, 5, 2, 5], [5, 4, 3, 5, 5, 2, ...
4 1001 [[1198, 1617, 2885, 3909, 3555, 1479, 3903, 39... [[4, 4, 4, 2, 2, 1, 4, 5], [4, 4, 2, 2, 1, 4, ...
... ... ... ...
6035 995 [[1894, 260, 247, 433, 170, 74, 912, 3097], [2... [[2, 4, 5, 3, 3, 4, 4, 4], [4, 5, 3, 3, 4, 4, ...
6036 996 [[1347, 2146, 1961, 2741, 1210, 527, 1196, 121... [[4, 3, 5, 3, 5, 5, 5, 5], [3, 5, 3, 5, 5, 5, ...
6037 997 [[1196, 2082, 3247, 2447, 2633, 2028, 593, 318... [[4, 3, 3, 3, 2, 5, 5, 5], [3, 3, 3, 2, 5, 5, ...
6038 998 [[2266, 1264, 1097, 1641, 805, 1388, 1968, 375... [[3, 4, 5, 5, 4, 3, 4, 3], [4, 5, 5, 4, 3, 4, ...
6039 999 [[1196, 1015, 1600, 1097, 1432, 1221, 1198, 12... [[5, 3, 1, 4, 2, 4, 4, 5], [3, 1, 4, 2, 4, 4, ...
上述数据种movie_ids由多个数组组成,如下
[['3186', '1721', '1270', '1022', '2340', '1836', '3408', '1207'],
['1721', '1270', '1022', '2340', '1836', '3408', '1207', '2804'],
['1270', '1022', '2340', '1836', '3408', '1207', '2804', '260'],
['1022', '2340'.......
使用explode函数将其movie_ids展开为多行
ratings_data_movies = ratings_data[["user_id", "movie_ids"]].explode("movie_ids", ignore_index=True)
得到如下输出:
user_id movie_ids
0 1 [3186, 1721, 1270, 1022, 2340, 1836, 3408, 1207]
1 1 [1721, 1270, 1022, 2340, 1836, 3408, 1207, 2804]
2 1 [1270, 1022, 2340, 1836, 3408, 1207, 2804, 260]
3 1 [1022, 2340, 1836, 3408, 1207, 2804, 260, 720]
4 1 [2340, 1836, 3408, 1207, 2804, 260, 720, 1193]
... ... ...
963964 999 [1959, 2676, 2540, 1363, 765, 3565, 1410, 2269]
963965 999 [2676, 2540, 1363, 765, 3565, 1410, 2269, 2504]
963966 999 [2540, 1363, 765, 3565, 1410, 2269, 2504, 455]
963967 999 [1363, 765, 3565, 1410, 2269, 2504, 455, 193]
963968 999 [1363, 765, 3565, 1410, 2269, 2504, 455, 193]
2.concat函数
拼接两个dataframe数据
ratings_data_transformed = pd.concat([ratings_data_movies, ratings_data_rating], axis=1)
axis=1按行拼接,axis=0按列拼接
将上面的输出和ratings数据按行拼接得
user_id movie_ids ratings
0 1 [3186, 1721, 1270, 1022, 2340, 1836, 3408, 1207] [4, 4, 5, 5, 3, 5, 4, 4]
1 1 [1721, 1270, 1022, 2340, 1836, 3408, 1207, 2804] [4, 5, 5, 3, 5, 4, 4, 5]
2 1 [1270, 1022, 2340, 1836, 3408, 1207, 2804, 260] [5, 5, 3, 5, 4, 4, 5, 4]
3 1 [1022, 2340, 1836, 3408, 1207, 2804, 260, 720] [5, 3, 5, 4, 4, 5, 4, 3]
4 1 [2340, 1836, 3408, 1207, 2804, 260, 720, 1193] [3, 5, 4, 4, 5, 4, 3, 5]
... ... ... ...
963964 999 [1959, 2676, 2540, 1363, 765, 3565, 1410, 2269] [1, 3, 2, 3, 3, 4, 2, 3]
963965 999 [2676, 2540, 1363, 765, 3565, 1410, 2269, 2504] [3, 2, 3, 3, 4, 2, 3, 3]
963966 999 [2540, 1363, 765, 3565, 1410, 2269, 2504, 455] [2, 3, 3, 4, 2, 3, 3, 2]
963967 999 [1363, 765, 3565, 1410, 2269, 2504, 455, 193] [3, 3, 4, 2, 3, 3, 2, 2]
963968 999 [1363, 765, 3565, 1410, 2269, 2504, 455, 193] [3, 3, 4, 2, 3, 3, 2, 2]
3.join函数
通过索引或者指定的列连接两个DataFrame。
DataFrame.join(other, on=None, how=’left’, lsuffix=”, rsuffix=”, sort=False)
通过on参数指定连接的列,使用users的索引去连接ratings_data_transformed,把指定的列设置为users的索引,然后用on去指定ratings_data_transformed的连接列
ratings_data_transformed1 = ratings_data_transformed.join(users.set_index("user_id"), on="user_id")
ratings_data_transformed如下:
user_id movie_ids ratings
0 1 [3186, 1721, 1270, 1022, 2340, 1836, 3408, 1207] [4, 4, 5, 5, 3, 5, 4, 4]
1 1 [1721, 1270, 1022, 2340, 1836, 3408, 1207, 2804] [4, 5, 5, 3, 5, 4, 4, 5]
2 1 [1270, 1022, 2340, 1836, 3408, 1207, 2804, 260] [5, 5, 3, 5, 4, 4, 5, 4]
3 1 [1022, 2340, 1836, 3408, 1207, 2804, 260, 720] [5, 3, 5, 4, 4, 5, 4, 3]
4 1 [2340, 1836, 3408, 1207, 2804, 260, 720, 1193] [3, 5, 4, 4, 5, 4, 3, 5]
... ... ... ...
963964 999 [1959, 2676, 2540, 1363, 765, 3565, 1410, 2269] [1, 3, 2, 3, 3, 4, 2, 3]
963965 999 [2676, 2540, 1363, 765, 3565, 1410, 2269, 2504] [3, 2, 3, 3, 4, 2, 3, 3]
963966 999 [2540, 1363, 765, 3565, 1410, 2269, 2504, 455] [2, 3, 3, 4, 2, 3, 3, 2]
963967 999 [1363, 765, 3565, 1410, 2269, 2504, 455, 193] [3, 3, 4, 2, 3, 3, 2, 2]
963968 999 [1363, 765, 3565, 1410, 2269, 2504, 455, 193] [3, 3, 4, 2, 3, 3, 2, 2]
users如下:
users:
user_id sex age_group occupation zip_code
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
... ... .. ... ... ...
6035 6036 F 25 15 32603
6036 6037 F 45 1 76006
6037 6038 F 56 1 14706
6038 6039 F 45 0 01060
6039 6040 M 25 6 11106
结果如下:
user_id movie_ids ratings sex age_group occupation zip_code
0 1 [3186, 1721, 1270, 1022, 2340, 1836, 3408, 1207] [4, 4, 5, 5, 3, 5, 4, 4] F 1 10 48067
1 1 [1721, 1270, 1022, 2340, 1836, 3408, 1207, 2804] [4, 5, 5, 3, 5, 4, 4, 5] F 1 10 48067
2 1 [1270, 1022, 2340, 1836, 3408, 1207, 2804, 260] [5, 5, 3, 5, 4, 4, 5, 4] F 1 10 48067
3 1 [1022, 2340, 1836, 3408, 1207, 2804, 260, 720] [5, 3, 5, 4, 4, 5, 4, 3] F 1 10 48067
4 1 [2340, 1836, 3408, 1207, 2804, 260, 720, 1193] [3, 5, 4, 4, 5, 4, 3, 5] F 1 10 48067
... ... ... ... .. ... ... ...
963964 999 [1959, 2676, 2540, 1363, 765, 3565, 1410, 2269] [1, 3, 2, 3, 3, 4, 2, 3] M 25 15 62558
963965 999 [2676, 2540, 1363, 765, 3565, 1410, 2269, 2504] [3, 2, 3, 3, 4, 2, 3, 3] M 25 15 62558
963966 999 [2540, 1363, 765, 3565, 1410, 2269, 2504, 455] [2, 3, 3, 4, 2, 3, 3, 2] M 25 15 62558
963967 999 [1363, 765, 3565, 1410, 2269, 2504, 455, 193] [3, 3, 4, 2, 3, 3, 2, 2] M 25 15 62558
963968 999 [1363, 765, 3565, 1410, 2269, 2504, 455, 193] [3, 3, 4, 2, 3, 3, 2, 2] M 25 15 62558
4.sort_values/ groupby函数
ratings_group = ratings.sort_values(by=["unix_timestamp"]).groupby("user_id")
按照userid分组并使用时间(unix_timestap)排序
ratings:
user_id movie_id rating unix_timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
... ... ... ... ...
1000204 6040 1091 1 956716541
1000205 6040 1094 5 956704887
1000206 6040 562 5 956704746
1000207 6040 1096 4 956715648
1000208 6040 1097 4 956715569
返回一个对象
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f88b2a8fe10>
转为list取出
print(ratings_group.apply(list))
user_id
1 [user_id, movie_id, rating, unix_timestamp]
10 [user_id, movie_id, rating, unix_timestamp]
100 [user_id, movie_id, rating, unix_timestamp]
1000 [user_id, movie_id, rating, unix_timestamp]
1001 [user_id, movie_id, rating, unix_timestamp]
...
995 [user_id, movie_id, rating, unix_timestamp]
996 [user_id, movie_id, rating, unix_timestamp]
997 [user_id, movie_id, rating, unix_timestamp]
998 [user_id, movie_id, rating, unix_timestamp]
999 [user_id, movie_id, rating, unix_timestamp]
该对象可以用key,movie_id,rating等取出
如print(ratings_group.movie_id.apply(list))
user_id
1 [3186, 1721, 1270, 1022, 2340, 1836, 3408, 120...
10 [597, 858, 743, 1210, 1948, 2312, 3751, 1282, ...
100 [260, 1676, 1198, 541, 1210, 3948, 3536, 2567,...
1000 [971, 260, 2990, 2973, 1210, 3068, 3153, 1198,...
1001 [1198, 1617, 2885, 3909, 3555, 1479, 3903, 394...
...
995 [1894, 260, 247, 433, 170, 74, 912, 3097, 1265...
996 [1347, 2146, 1961, 2741, 1210, 527, 1196, 1213...
997 [1196, 2082, 3247, 2447, 2633, 2028, 593, 318,...
998 [2266, 1264, 1097, 1641, 805, 1388, 1968, 3751...
999 [1196, 1015, 1600, 1097, 1432, 1221, 1198, 121...
取出了以userid分组的所有的movie_id信息