一、merge函数用途
pandas中的merge()函数类似于SQL中join的用法,可以将不同数据集依照某些字段(属性)进行合并操作,得到一个新的数据集。
二、merge()函数的具体参数
- 用法:
DataFrame1.merge(DataFrame2, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(’_x’, ‘_y’))
- 参数说明
| 参数 | 说明 |
| how | 默认为inner,可设为inner/outer/left/right |
| on | 根据某个字段进行连接,必须存在于两个DateFrame中(若未同时存在,则需要分别使用left_on和right_on来设置) |
| left_on | 左连接,以DataFrame1中用作连接键的列 |
| right_on | 右连接,以DataFrame2中用作连接键的列 |
| left_index | 将DataFrame1行索引用作连接键 |
| right_index | 将DataFrame2行索引用作连接键 |
| sort | 根据连接键对合并后的数据进行排列,默认为True |
| suffixes | 对两个数据集中出现的重复列,新数据集中加上后缀_x,_y进行区别 |
三、merge用法举例
- 创建两个df
#利用字典dict创建数据框 import pandas as pd df1 = pd.DataFrame({'lkey':['foo','bar','baz','foo'],'value':[1,2,3,4]}) df2 = pd.DataFrame({'rkey':['foo','bar','qux','bar'],'value':[5,6,7,8]}) print(df1) print(df2) lkey value 0 foo 1 1 bar 2 2 baz 3 3 foo 4 rkey value 0 foo 5 1 bar 6 2 qux 7 3 bar 8 - 内连接(Inner)
df1 = pd.DataFrame({'lkey':['foo','bar','baz','foo'],'value':[1,2,3,4]})
df2 = pd.DataFrame({'rkey':['foo','bar','qux','bar'],'value':[5,6,7,8]})
print(df1)
print(df2)
df3_leilianjie = df1.merge(df2,left_on='lkey',right_on='rkey')
print(df3_leilianjie)
lkey value
0 foo 1
1 bar 2
2 baz 3
3 foo 4
rkey value
0 foo 5
1 bar 6
2 qux 7
3 bar 8
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 4 foo 5
2 bar 2 bar 6
3 bar 2 bar 8
- 右链接(Right)
df1 = pd.DataFrame({'lkey':['foo','bar','baz','foo'],'value':[1,2,3,4]})
df2 = pd.DataFrame({'rkey':['foo','bar','qux','bar'],'value':[5,6,7,8]})
print(df1)
print(df2)
df4_youlianjie = df1.merge(df2,left_on='lkey',right_on='rkey',how='right')
print(df4_youlianjie)
lkey value
0 foo 1
1 bar 2
2 baz 3
3 foo 4
rkey value
0 foo 5
1 bar 6
2 qux 7
3 bar 8
lkey value_x rkey value_y
0 foo 1.0 foo 5
1 foo 4.0 foo 5
2 bar 2.0 bar 6
3 NaN NaN qux 7
4 bar 2.0 bar 8
- 全链接(Outer)
df1 = pd.DataFrame({'lkey':['foo','bar','baz','foo'],'value':[1,2,3,4]})
df2 = pd.DataFrame({'rkey':['foo','bar','qux','bar'],'value':[5,6,7,8]})
print(df1)
print(df2)
df5_quanlianjie = df1.merge(df2,left_on='lkey',right_on='rkey',how='outer')
print(df5_quanlianjie)
lkey value
0 foo 1
1 bar 2
2 baz 3
3 foo 4
rkey value
0 foo 5
1 bar 6
2 qux 7
3 bar 8
lkey value_x rkey value_y
0 foo 1.0 foo 5.0
1 foo 4.0 foo 5.0
2 bar 2.0 bar 6.0
3 bar 2.0 bar 8.0
4 baz 3.0 NaN NaN
5 NaN NaN qux 7.0
另一个实列
3张csv表如下:
ratings.csv

users.csv

movies.csv

分别读取数据,并查看df中的前3行
file_path_movie = r'C:\TELCEL_MEXICO_BOT\A\movie.csv'
file_path_user = r'C:\TELCEL_MEXICO_BOT\A\user.csv'
file_path_rating = r'C:\TELCEL_MEXICO_BOT\A\rating.csv'
df_movie = pd.read_csv(file_path_movie,encoding='utf-8')
df_user = pd.read_csv(file_path_user,encoding='utf-8')
df_rating = pd.read_csv(file_path_rating,encoding='utf-8')
# 分别查看df的前5行数据
print(df_movie.head(5))
print(df_user.head(5))
print(df_rating.head(5))
MovieId Title Genres
0 104 Nezha Comedy
1 120 danao Comedy
2 136 Feiyue Comedy
3 152 Yangren Fighting
4 168 Jinwumen Fighting
UserId Gender Age Ocupation Zip-code
0 张三 F 30 Doctor 2334539
1 张三 F 30 Doctor 2334539
2 张三 F 30 Doctor 2334539
3 张三 F 30 Doctor 2334539
4 李四 M 50 Engineer 1004039
UserId MovieId Rating Timestamp
0 张三 104 6 255
1 张三 120 7 294
2 张三 136 4 333
3 张三 152 7 372
4 李四 168 5 411
df_rating_user = pd.merge(df_rating,df_user,left_on='UserId',right_on='UserId', how='inner')
print(df_rating_user.head(10))
UserId MovieId Rating Timestamp Gender Age Ocupation Zip-code
0 张三 104 6 255 F 30 Doctor 2334539
1 张三 104 6 255 F 30 Doctor 2334539
2 张三 104 6 255 F 30 Doctor 2334539
3 张三 104 6 255 F 30 Doctor 2334539
4 张三 120 7 294 F 30 Doctor 2334539
5 张三 120 7 294 F 30 Doctor 2334539
6 张三 120 7 294 F 30 Doctor 2334539
7 张三 120 7 294 F 30 Doctor 2334539
8 张三 136 4 333 F 30 Doctor 2334539
9 张三 136 4 333 F 30 Doctor 2334539
# 两张表中,都以UserId进行join, how=inner表示这两边表中都有这个数据才会被保留,否则丢弃
df_rating_user_movie = pd.merge(df_rating_user,df_movie,left_on='MovieId',right_on='MovieId', how='inner')
print(df_rating_user_movie.head(20))
UserId MovieId Rating Timestamp ... Ocupation Zip-code Title Genres
0 张三 104 6 255 ... Doctor 2334539 Nezha Comedy
1 张三 104 6 255 ... Doctor 2334539 Nezha Comedy
2 张三 104 6 255 ... Doctor 2334539 Nezha Comedy
3 张三 104 6 255 ... Doctor 2334539 Nezha Comedy
4 张三 120 7 294 ... Doctor 2334539 danao Comedy
5 张三 120 7 294 ... Doctor 2334539 danao Comedy
6 张三 120 7 294 ... Doctor 2334539 danao Comedy
7 张三 120 7 294 ... Doctor 2334539 danao Comedy
8 张三 136 4 333 ... Doctor 2334539 Feiyue Comedy
9 张三 136 4 333 ... Doctor 2334539 Feiyue Comedy
10 张三 136 4 333 ... Doctor 2334539 Feiyue Comedy
11 张三 136 4 333 ... Doctor 2334539 Feiyue Comedy
12 张三 152 7 372 ... Doctor 2334539 Yangren Fighting
13 张三 152 7 372 ... Doctor 2334539 Yangren Fighting
14 张三 152 7 372 ... Doctor 2334539 Yangren Fighting
15 张三 152 7 372 ... Doctor 2334539 Yangren Fighting
16 李四 168 5 411 ... Engineer 1004039 Jinwumen Fighting
17 李四 168 5 411 ... Engineer 1004039 Jinwumen Fighting
18 李四 168 5 411 ... Engineer 1004039 Jinwumen Fighting
19 李四 168 5 411 ... Engineer 1004039 Jinwumen Fighting
[20 rows x 10 columns]


left = pd.DataFrame({"sno":[11,12,13,14],'name':['Aandy','Amy','Tome','Jack']})
right = pd.DataFrame({'sno':[11,12,13,14],'age':[21,22,23,24]})
df1 = pd.merge(left,right,on='sno')
print(left)
print(right)
print(df1)
sno name
0 11 Aandy
1 12 Amy
2 13 Tome
3 14 Jack
sno age
0 11 21
1 12 22
2 13 23
3 14 24
sno name age
0 11 Aandy 21
1 12 Amy 22
2 13 Tome 23
3 14 Jack 24
2.2 one - to - many 一对多关系的merge
注意,数据会被复制,数目以多的一边为准

left = pd.DataFrame({'sno':[11,12,13,14], '姓名':['张三','李四','王五','赵六']})
right = pd.DataFrame({'sno':[11,11,11,12,12,13],'age':['语文88','数学90','英语75','语文66','数学55','英语29']})
print(left)
print(right)
sno 姓名
0 11 张三
1 12 李四
2 13 王五
3 14 赵六
sno age
0 11 语文88
1 11 数学90
2 11 英语75
3 12 语文66
4 12 数学55
5 13 英语29
df2 = pd.merge(left,right,on='sno')
print(df2)
sno 姓名 age
0 11 张三 语文88
1 11 张三 数学90
2 11 张三 英语75
3 12 李四 语文66
4 12 李四 数学55
5 13 王五 英语29
2.3 many-to-many 多对多关系的merge
注意:结果数量会出现乘法

left = pd.DataFrame({"sno":[11,11,12,12,12],'爱好':['篮球','羽毛球','乒乓球','篮球','足球']})
right = pd.DataFrame({'sno':[11,11,11,12,12,13],'age':['语文88','数学90','英语75','语文66','数学55','英语29']})
df2 = pd.merge(left,right,on='sno')
print(left)
print(right)
print(df2)
sno 爱好
0 11 篮球
1 11 羽毛球
2 12 乒乓球
3 12 篮球
4 12 足球
sno age
0 11 语文88
1 11 数学90
2 11 英语75
3 12 语文66
4 12 数学55
5 13 英语29
sno 爱好 age
0 11 篮球 语文88
1 11 篮球 数学90
2 11 篮球 英语75
3 11 羽毛球 语文88
4 11 羽毛球 数学90
5 11 羽毛球 英语75
6 12 乒乓球 语文66
7 12 乒乓球 数学55
8 12 篮球 语文66
9 12 篮球 数学55
10 12 足球 语文66
11 12 足球 数学55
3. 理解left join, right join,inner join, outer join的区别

left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'C':['C0','C1','C4','C5'],'D':['D0','D1','D4','D5']})
print(left)
print(right)
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K4 C4 D4
3 K5 C5 D5
# 3.1 inner join, 默认
# 左边和右边的key都有,才会出现在结果里
df3 = pd.merge(left,right,how='inner')
print(df3)
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'C':['C0','C1','C4','C5'],'D':['D0','D1','D4','D5']})
print(left)
print(right)
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K4 C4 D4
3 K5 C5 D5
# 3.2 left join
# 左边的都会出现在结果里,右边的如果无法匹配则为NaN
df4 = pd.merge(left,right,how='left')
print(df4)
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 NaN NaN
3 K3 A3 B3 NaN NaN
left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'C':['C0','C1','C4','C5'],'D':['D0','D1','D4','D5']})
print(left)
print(right)
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K4 C4 D4
3 K5 C5 D5
#右边的都会出现在结果里,左边的如果无法匹配则为NaN
df5 = pd.merge(left,right,how='right')
print(df5)
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K4 NaN NaN C4 D4
3 K5 NaN NaN C5 D5
left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'C':['C0','C1','C4','C5'],'D':['D0','D1','D4','D5']})
print(left)
print(right)
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K4 C4 D4
3 K5 C5 D5
# 左边,右边的都会出现在结果里,如果无法匹配则为NaN
df6 = pd.merge(left,right,how='outer')
print(df6)
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 NaN NaN
3 K3 A3 B3 NaN NaN
4 K4 NaN NaN C4 D4
5 K5 NaN NaN C5 D5
4. 如果出现非key的字段重名
left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'A':['A10','A11','A12','A13'],'D':['D0','D1','D4','D5']})
print(left)
print(right)
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key A D
0 K0 A10 D0
1 K1 A11 D1
2 K4 A12 D4
3 K5 A13 D5
df7 = pd.merge(left,right,on='key')
print(df7)
key A_x B A_y D
0 K0 A0 B0 A10 D0
1 K1 A1 B1 A11 D1
left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'A':['A10','A11','A12','A13'],'D':['D0','D1','D4','D5']})
print(left)
print(right)
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key A D
0 K0 A10 D0
1 K1 A11 D1
2 K4 A12 D4
3 K5 A13 D5
df8 = pd.merge(left,right,on='key',suffixes=('_left','_right')) #加了个元祖,换成指定后缀
print(df8)
key A_left B A_right D
0 K0 A0 B0 A10 D0
1 K1 A1 B1 A11 D1
1万+

被折叠的 条评论
为什么被折叠?



