Pandas的merge语法

一、merge函数用途

pandas中的merge()函数类似于SQL中join的用法,可以将不同数据集依照某些字段(属性)进行合并操作,得到一个新的数据集。

二、merge()函数的具体参数

  • 用法:

DataFrame1.merge(DataFrame2, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(’_x’, ‘_y’))

  • 参数说明
参数说明
how默认为inner,可设为inner/outer/left/right
on根据某个字段进行连接,必须存在于两个DateFrame中(若未同时存在,则需要分别使用left_on和right_on来设置)
left_on左连接,以DataFrame1中用作连接键的列
right_on右连接,以DataFrame2中用作连接键的列
left_index将DataFrame1行索引用作连接键
right_index将DataFrame2行索引用作连接键
sort根据连接键对合并后的数据进行排列,默认为True
suffixes对两个数据集中出现的重复列,新数据集中加上后缀_x,_y进行区别

三、merge用法举例

  • 创建两个df
    #利用字典dict创建数据框
    import pandas as pd
    
    df1 = pd.DataFrame({'lkey':['foo','bar','baz','foo'],'value':[1,2,3,4]})
    df2 = pd.DataFrame({'rkey':['foo','bar','qux','bar'],'value':[5,6,7,8]})
    print(df1)
    print(df2)
    
      lkey  value
    0  foo      1
    1  bar      2
    2  baz      3
    3  foo      4
      rkey  value
    0  foo      5
    1  bar      6
    2  qux      7
    3  bar      8

  • 内连接(Inner)
df1 = pd.DataFrame({'lkey':['foo','bar','baz','foo'],'value':[1,2,3,4]})
df2 = pd.DataFrame({'rkey':['foo','bar','qux','bar'],'value':[5,6,7,8]})
print(df1)
print(df2)

df3_leilianjie = df1.merge(df2,left_on='lkey',right_on='rkey')
print(df3_leilianjie)

  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      4

  rkey  value
0  foo      5
1  bar      6
2  qux      7
3  bar      8

  lkey  value_x rkey  value_y
0  foo        1  foo        5
1  foo        4  foo        5
2  bar        2  bar        6
3  bar        2  bar        8
  • 右链接(Right)
df1 = pd.DataFrame({'lkey':['foo','bar','baz','foo'],'value':[1,2,3,4]})
df2 = pd.DataFrame({'rkey':['foo','bar','qux','bar'],'value':[5,6,7,8]})
print(df1)
print(df2)

df4_youlianjie = df1.merge(df2,left_on='lkey',right_on='rkey',how='right')
print(df4_youlianjie)

  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      4
  rkey  value
0  foo      5
1  bar      6
2  qux      7
3  bar      8
  lkey  value_x rkey  value_y
0  foo      1.0  foo        5
1  foo      4.0  foo        5
2  bar      2.0  bar        6
3  NaN      NaN  qux        7
4  bar      2.0  bar        8
  • 全链接(Outer)
df1 = pd.DataFrame({'lkey':['foo','bar','baz','foo'],'value':[1,2,3,4]})
df2 = pd.DataFrame({'rkey':['foo','bar','qux','bar'],'value':[5,6,7,8]})
print(df1)
print(df2)

df5_quanlianjie = df1.merge(df2,left_on='lkey',right_on='rkey',how='outer')
print(df5_quanlianjie)

  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      4
  rkey  value
0  foo      5
1  bar      6
2  qux      7
3  bar      8
  lkey  value_x rkey  value_y
0  foo      1.0  foo      5.0
1  foo      4.0  foo      5.0
2  bar      2.0  bar      6.0
3  bar      2.0  bar      8.0
4  baz      3.0  NaN      NaN
5  NaN      NaN  qux      7.0

另一个实列

3张csv表如下:

ratings.csv

users.csv

movies.csv

分别读取数据,并查看df中的前3行

file_path_movie = r'C:\TELCEL_MEXICO_BOT\A\movie.csv'
file_path_user = r'C:\TELCEL_MEXICO_BOT\A\user.csv'
file_path_rating = r'C:\TELCEL_MEXICO_BOT\A\rating.csv'
df_movie = pd.read_csv(file_path_movie,encoding='utf-8')
df_user = pd.read_csv(file_path_user,encoding='utf-8')
df_rating = pd.read_csv(file_path_rating,encoding='utf-8')
# 分别查看df的前5行数据
print(df_movie.head(5))
print(df_user.head(5))
print(df_rating.head(5))

   MovieId     Title    Genres
0      104     Nezha    Comedy
1      120     danao    Comedy
2      136    Feiyue    Comedy
3      152   Yangren  Fighting
4      168  Jinwumen  Fighting
  UserId Gender  Age Ocupation  Zip-code
0     张三      F   30    Doctor   2334539
1     张三      F   30    Doctor   2334539
2     张三      F   30    Doctor   2334539
3     张三      F   30    Doctor   2334539
4     李四      M   50  Engineer   1004039
  UserId  MovieId  Rating  Timestamp
0     张三      104       6        255
1     张三      120       7        294
2     张三      136       4        333
3     张三      152       7        372
4     李四      168       5        411
df_rating_user = pd.merge(df_rating,df_user,left_on='UserId',right_on='UserId', how='inner')
print(df_rating_user.head(10))

  UserId  MovieId  Rating  Timestamp Gender  Age Ocupation  Zip-code
0     张三      104       6        255      F   30    Doctor   2334539
1     张三      104       6        255      F   30    Doctor   2334539
2     张三      104       6        255      F   30    Doctor   2334539
3     张三      104       6        255      F   30    Doctor   2334539
4     张三      120       7        294      F   30    Doctor   2334539
5     张三      120       7        294      F   30    Doctor   2334539
6     张三      120       7        294      F   30    Doctor   2334539
7     张三      120       7        294      F   30    Doctor   2334539
8     张三      136       4        333      F   30    Doctor   2334539
9     张三      136       4        333      F   30    Doctor   2334539

# 两张表中,都以UserId进行join, how=inner表示这两边表中都有这个数据才会被保留,否则丢弃

df_rating_user_movie = pd.merge(df_rating_user,df_movie,left_on='MovieId',right_on='MovieId', how='inner')
print(df_rating_user_movie.head(20))

   UserId  MovieId  Rating  Timestamp  ... Ocupation  Zip-code     Title    Genres
0      张三      104       6        255  ...    Doctor   2334539     Nezha    Comedy
1      张三      104       6        255  ...    Doctor   2334539     Nezha    Comedy
2      张三      104       6        255  ...    Doctor   2334539     Nezha    Comedy
3      张三      104       6        255  ...    Doctor   2334539     Nezha    Comedy
4      张三      120       7        294  ...    Doctor   2334539     danao    Comedy
5      张三      120       7        294  ...    Doctor   2334539     danao    Comedy
6      张三      120       7        294  ...    Doctor   2334539     danao    Comedy
7      张三      120       7        294  ...    Doctor   2334539     danao    Comedy
8      张三      136       4        333  ...    Doctor   2334539    Feiyue    Comedy
9      张三      136       4        333  ...    Doctor   2334539    Feiyue    Comedy
10     张三      136       4        333  ...    Doctor   2334539    Feiyue    Comedy
11     张三      136       4        333  ...    Doctor   2334539    Feiyue    Comedy
12     张三      152       7        372  ...    Doctor   2334539   Yangren  Fighting
13     张三      152       7        372  ...    Doctor   2334539   Yangren  Fighting
14     张三      152       7        372  ...    Doctor   2334539   Yangren  Fighting
15     张三      152       7        372  ...    Doctor   2334539   Yangren  Fighting
16     李四      168       5        411  ...  Engineer   1004039  Jinwumen  Fighting
17     李四      168       5        411  ...  Engineer   1004039  Jinwumen  Fighting
18     李四      168       5        411  ...  Engineer   1004039  Jinwumen  Fighting
19     李四      168       5        411  ...  Engineer   1004039  Jinwumen  Fighting

[20 rows x 10 columns]

left = pd.DataFrame({"sno":[11,12,13,14],'name':['Aandy','Amy','Tome','Jack']})
right = pd.DataFrame({'sno':[11,12,13,14],'age':[21,22,23,24]})

df1 = pd.merge(left,right,on='sno')
print(left)
print(right)
print(df1)

   sno   name
0   11  Aandy
1   12    Amy
2   13   Tome
3   14   Jack
   sno  age
0   11   21
1   12   22
2   13   23
3   14   24
   sno   name  age
0   11  Aandy   21
1   12    Amy   22
2   13   Tome   23
3   14   Jack   24

2.2 one - to - many 一对多关系的merge

注意,数据会被复制,数目以多的一边为准

left = pd.DataFrame({'sno':[11,12,13,14], '姓名':['张三','李四','王五','赵六']})
right = pd.DataFrame({'sno':[11,11,11,12,12,13],'age':['语文88','数学90','英语75','语文66','数学55','英语29']})
print(left)
print(right)

   sno  姓名
0   11  张三
1   12  李四
2   13  王五
3   14  赵六
   sno   age
0   11  语文88
1   11  数学90
2   11  英语75
3   12  语文66
4   12  数学55
5   13  英语29

df2 = pd.merge(left,right,on='sno')
print(df2)

   sno  姓名   age
0   11  张三  语文88
1   11  张三  数学90
2   11  张三  英语75
3   12  李四  语文66
4   12  李四  数学55
5   13  王五  英语29

2.3  many-to-many 多对多关系的merge

注意:结果数量会出现乘法

left = pd.DataFrame({"sno":[11,11,12,12,12],'爱好':['篮球','羽毛球','乒乓球','篮球','足球']})
right = pd.DataFrame({'sno':[11,11,11,12,12,13],'age':['语文88','数学90','英语75','语文66','数学55','英语29']})
df2 = pd.merge(left,right,on='sno')
print(left)
print(right)
print(df2)

   sno   爱好
0   11   篮球
1   11  羽毛球
2   12  乒乓球
3   12   篮球
4   12   足球
   sno   age
0   11  语文88
1   11  数学90
2   11  英语75
3   12  语文66
4   12  数学55
5   13  英语29
    sno   爱好   age
0    11   篮球  语文88
1    11   篮球  数学90
2    11   篮球  英语75
3    11  羽毛球  语文88
4    11  羽毛球  数学90
5    11  羽毛球  英语75
6    12  乒乓球  语文66
7    12  乒乓球  数学55
8    12   篮球  语文66
9    12   篮球  数学55
10   12   足球  语文66
11   12   足球  数学55

3. 理解left join, right join,inner join, outer join的区别

left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'C':['C0','C1','C4','C5'],'D':['D0','D1','D4','D5']})

print(left)
print(right)

  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K4  C4  D4
3  K5  C5  D5

# 3.1 inner join, 默认
# 左边和右边的key都有,才会出现在结果里

df3 = pd.merge(left,right,how='inner')
print(df3)
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'C':['C0','C1','C4','C5'],'D':['D0','D1','D4','D5']})

print(left)
print(right)

  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K4  C4  D4
3  K5  C5  D5

# 3.2 left join  
# 左边的都会出现在结果里,右边的如果无法匹配则为NaN
df4 = pd.merge(left,right,how='left')
print(df4)

  key   A   B    C    D
0  K0  A0  B0   C0   D0
1  K1  A1  B1   C1   D1
2  K2  A2  B2  NaN  NaN
3  K3  A3  B3  NaN  NaN

left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'C':['C0','C1','C4','C5'],'D':['D0','D1','D4','D5']})

print(left)
print(right)

  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K4  C4  D4
3  K5  C5  D5

#右边的都会出现在结果里,左边的如果无法匹配则为NaN
df5 = pd.merge(left,right,how='right')
print(df5)

  key    A    B   C   D
0  K0   A0   B0  C0  D0
1  K1   A1   B1  C1  D1
2  K4  NaN  NaN  C4  D4
3  K5  NaN  NaN  C5  D5

left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'C':['C0','C1','C4','C5'],'D':['D0','D1','D4','D5']})

print(left)
print(right)

  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K4  C4  D4
3  K5  C5  D5

# 左边,右边的都会出现在结果里,如果无法匹配则为NaN
df6 = pd.merge(left,right,how='outer')
print(df6)

  key    A    B    C    D
0  K0   A0   B0   C0   D0
1  K1   A1   B1   C1   D1
2  K2   A2   B2  NaN  NaN
3  K3   A3   B3  NaN  NaN
4  K4  NaN  NaN   C4   D4
5  K5  NaN  NaN   C5   D5

4. 如果出现非key的字段重名

left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'A':['A10','A11','A12','A13'],'D':['D0','D1','D4','D5']})

print(left)
print(right)

  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
  key    A   D
0  K0  A10  D0
1  K1  A11  D1
2  K4  A12  D4
3  K5  A13  D5

df7 = pd.merge(left,right,on='key')
print(df7)

  key A_x   B  A_y   D
0  K0  A0  B0  A10  D0
1  K1  A1  B1  A11  D1
left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K4','K5'],'A':['A10','A11','A12','A13'],'D':['D0','D1','D4','D5']})

print(left)
print(right)

  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
  key    A   D
0  K0  A10  D0
1  K1  A11  D1
2  K4  A12  D4
3  K5  A13  D5

df8 = pd.merge(left,right,on='key',suffixes=('_left','_right'))  #加了个元祖,换成指定后缀
print(df8)

  key A_left   B A_right   D
0  K0     A0  B0     A10  D0
1  K1     A1  B1     A11  D1

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值