pandas dataframe的合并

1 准备工作

创建3个DataFrame:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df1 = pd.DataFrame(np.ones((4, 4)) * 1, columns=list('ABCD'), index=list('1234'))

In [4]: df2 = pd.DataFrame(np.ones((4, 4)) * 2, columns=list('ABEF'), index=list('3456'))

In [5]: df3 = pd.DataFrame(np.ones((4, 4)) * 3, columns=list('CDEF'), index=list('1256'))

In [6]: df1
Out[6]: 
     A    B    C    D
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0

In [7]: df2
Out[7]: 
     A    B    E    F
3  2.0  2.0  2.0  2.0
4  2.0  2.0  2.0  2.0
5  2.0  2.0  2.0  2.0
6  2.0  2.0  2.0  2.0

In [8]: df3
Out[8]: 
     C    D    E    F
1  3.0  3.0  3.0  3.0
2  3.0  3.0  3.0  3.0
5  3.0  3.0  3.0  3.0
6  3.0  3.0  3.0  3.0

2 concat 合并

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

示例:

In [9]: pd.concat([df1, df2])
Out[9]: 
     A    B    C    D    E    F
1  1.0  1.0  1.0  1.0  NaN  NaN
2  1.0  1.0  1.0  1.0  NaN  NaN
3  1.0  1.0  1.0  1.0  NaN  NaN
4  1.0  1.0  1.0  1.0  NaN  NaN
3  2.0  2.0  NaN  NaN  2.0  2.0
4  2.0  2.0  NaN  NaN  2.0  2.0
5  2.0  2.0  NaN  NaN  2.0  2.0
6  2.0  2.0  NaN  NaN  2.0  2.0

2.1 axis

默认值:axis=0

axis=0:竖方向(index)合并,合并方向index作列表相加,非合并方向columns取并集

axis=1:横方向(columns)合并,合并方向columns作列表相加,非合并方向index取并集

axis=0:

In [10]: pd.concat([df1, df2], axis=0)
Out[10]: 
     A    B    C    D    E    F
1  1.0  1.0  1.0  1.0  NaN  NaN
2  1.0  1.0  1.0  1.0  NaN  NaN
3  1.0  1.0  1.0  1.0  NaN  NaN
4  1.0  1.0  1.0  1.0  NaN  NaN
3  2.0  2.0  NaN  NaN  2.0  2.0
4  2.0  2.0  NaN  NaN  2.0  2.0
5  2.0  2.0  NaN  NaN  2.0  2.0
6  2.0  2.0  NaN  NaN  2.0  2.0

axis=1:

In [11]: pd.concat([df1, df2], axis=1)
Out[11]: 
     A    B    C    D    A    B    E    F
1  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN
2  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN
3  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0
4  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0
5  NaN  NaN  NaN  NaN  2.0  2.0  2.0  2.0
6  NaN  NaN  NaN  NaN  2.0  2.0  2.0  2.0

注意:原df中,取并集的行/列名称不能有重复项,即axis=0时columns不能有重复项,axis=1时index不能有重复项。

In [12]: df1.columns = list('ABCC')

In [13]: pd.concat([df1, df2], axis=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-86fa31c57748> in <module>
----> 1 pd.concat([df1, df2], axis=0)

d:\python\python36\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, lev
els, names, verify_integrity, copy)
    205                        verify_integrity=verify_integrity,
    206                        copy=copy)
--> 207     return op.get_result()
    208
    209

d:\python\python36\lib\site-packages\pandas\core\reshape\concat.py in get_result(self)
    405             new_data = concatenate_block_managers(
    406                 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 407                 copy=self.copy)
    408             if not self.copy:
    409                 new_data._consolidate_inplace()

d:\python\python36\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, cop
y)
   4830     blocks = [make_block(
   4831         concatenate_join_units(join_units, concat_axis, copy=copy),
-> 4832         placement=placement) for placement, join_units in concat_plan]
   4833
   4834     return BlockManager(blocks, axes)

d:\python\python36\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
   4828          for mgr, indexers in mgrs_indexers], concat_axis)
   4829
-> 4830     blocks = [make_block(
   4831         concatenate_join_units(join_units, concat_axis, copy=copy),
   4832         placement=placement) for placement, join_units in concat_plan]

d:\python\python36\lib\site-packages\pandas\core\internals.py in combine_concat_plans(plans, concat_axis)
   5065         while num_ended[0] != len(next_items):
   5066             if num_ended[0] > 0:
-> 5067                 raise ValueError("Plan shapes are not aligned")
   5068
   5069             placements, units = zip(*next_items)

ValueError: Plan shapes are not aligned

2.2 join

默认值:join='outer'

非合并方向的行/列名称:取交集(inner),取并集(outer)。

axis=0时,join='inner',columns取交集:

In [15]: pd.concat([df1, df2], axis=0, join='inner')
Out[15]: 
     A    B
1  1.0  1.0
2  1.0  1.0
3  1.0  1.0
4  1.0  1.0
3  2.0  2.0
4  2.0  2.0
5  2.0  2.0
6  2.0  2.0

axis=1时join='inner',index取交集:

In [16]: pd.concat([df1, df2], axis=1, join='inner')
Out[16]: 
     A    B    C    D    A    B    E    F
3  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0
4  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0

2.3 join_axes 

默认值:join_axes=None,取并集

合并后,可以设置非合并方向的行/列名称,使用某个df的行/列名称

axis=0时,join_axes=[df1.columns],合并后columns使用df1的:

In [17]: pd.concat([df1, df2], axis=0, join_axes=[df1.columns])
Out[17]: 
     A    B    C    D
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0
3  2.0  2.0  NaN  NaN
4  2.0  2.0  NaN  NaN
5  2.0  2.0  NaN  NaN
6  2.0  2.0  NaN  NaN

axis=1时,axes=[df1.index],合并后index使用df2的:

In [18]: pd.concat([df1, df2], axis=1, join_axes=[df1.index])
Out[18]: 
     A    B    C    D    A    B    E    F
1  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN
2  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN
3  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0
4  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0

同时设置join和join_axes,以join_axes为准(相当于只设置了join_axes,没有设置join):

In [19]: pd.concat([df1, df2], axis=0, join='inner', join_axes=[df1.columns])
Out[19]: 
     A    B    C    D
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0
3  2.0  2.0  NaN  NaN
4  2.0  2.0  NaN  NaN
5  2.0  2.0  NaN  NaN
6  2.0  2.0  NaN  NaN

2.4 ignore_index

默认值:ignore_index=False

合并方向是否忽略原行/列名称,而采用系统默认的索引,即从0开始的int。

axis=0时ignore_index=True,index采用系统默认索引:

In [20]: pd.concat([df1, df2], axis=0, ignore_index=True)
Out[20]: 
     A    B    C    D    E    F
0  1.0  1.0  1.0  1.0  NaN  NaN
1  1.0  1.0  1.0  1.0  NaN  NaN
2  1.0  1.0  1.0  1.0  NaN  NaN
3  1.0  1.0  1.0  1.0  NaN  NaN
4  2.0  2.0  NaN  NaN  2.0  2.0
5  2.0  2.0  NaN  NaN  2.0  2.0
6  2.0  2.0  NaN  NaN  2.0  2.0
7  2.0  2.0  NaN  NaN  2.0  2.0

 axis=1时ignore_index=True,columns采用系统默认索引:

In [21]: pd.concat([df1, df2], axis=1, ignore_index=True)
Out[21]: 
     0    1    2    3    4    5    6    7
1  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN
2  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN
3  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0
4  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0
5  NaN  NaN  NaN  NaN  2.0  2.0  2.0  2.0
6  NaN  NaN  NaN  NaN  2.0  2.0  2.0  2.0

2.5 keys

默认值:keys=None

可以加一层标签,标识行/列名称属于原来哪个df。

axis=0时设置keys:

In [22]: pd.concat([df1, df2],  axis=0, keys=['x', 'y'])
Out[22]: 
       A    B    C    D    E    F
x 1  1.0  1.0  1.0  1.0  NaN  NaN
  2  1.0  1.0  1.0  1.0  NaN  NaN
  3  1.0  1.0  1.0  1.0  NaN  NaN
  4  1.0  1.0  1.0  1.0  NaN  NaN
y 3  2.0  2.0  NaN  NaN  2.0  2.0
  4  2.0  2.0  NaN  NaN  2.0  2.0
  5  2.0  2.0  NaN  NaN  2.0  2.0
  6  2.0  2.0  NaN  NaN  2.0  2.0

 axis=1时设置keys:

In [23]: pd.concat([df1, df2], axis=1, keys=['x', 'y'])
Out[23]: 
     x                   y
     A    B    C    D    A    B    E    F
1  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN
2  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN
3  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0
4  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0
5  NaN  NaN  NaN  NaN  2.0  2.0  2.0  2.0
6  NaN  NaN  NaN  NaN  2.0  2.0  2.0  2.0

也可以传字典取代keys:

In [24]: pd.concat({'x': df1, 'y': df2}, axis=0)
Out[24]: 
       A    B    C    D    E    F
x 1  1.0  1.0  1.0  1.0  NaN  NaN
  2  1.0  1.0  1.0  1.0  NaN  NaN
  3  1.0  1.0  1.0  1.0  NaN  NaN
  4  1.0  1.0  1.0  1.0  NaN  NaN
y 3  2.0  2.0  NaN  NaN  2.0  2.0
  4  2.0  2.0  NaN  NaN  2.0  2.0
  5  2.0  2.0  NaN  NaN  2.0  2.0
  6  2.0  2.0  NaN  NaN  2.0  2.0

2.6 levels

默认值:levels=None

明确行/列名称取值范围

In [27]: df = pd.concat([df1, df2], axis=0, keys=['x', 'y'], levels=[['x', 'y', 'z', 'w']])

In [28]: df
Out[28]: 
       A    B    C    D    E    F
x 1  1.0  1.0  1.0  1.0  NaN  NaN
  2  1.0  1.0  1.0  1.0  NaN  NaN
  3  1.0  1.0  1.0  1.0  NaN  NaN
  4  1.0  1.0  1.0  1.0  NaN  NaN
y 3  2.0  2.0  NaN  NaN  2.0  2.0
  4  2.0  2.0  NaN  NaN  2.0  2.0
  5  2.0  2.0  NaN  NaN  2.0  2.0
  6  2.0  2.0  NaN  NaN  2.0  2.0

In [29]: df.index.levels
Out[29]: FrozenList([['x', 'y', 'z', 'w'], ['1', '2', '3', '4', '5', '6']])

2.7 concat多个DataFrame

In [39]: pd.concat([df1, df2, df3], join_axes=[df1.columns])
Out[39]: 
     A    B    C    D
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0
3  2.0  2.0  NaN  NaN
4  2.0  2.0  NaN  NaN
5  2.0  2.0  NaN  NaN
6  2.0  2.0  NaN  NaN
1  NaN  NaN  3.0  3.0
2  NaN  NaN  3.0  3.0
5  NaN  NaN  3.0  3.0
6  NaN  NaN  3.0  3.0

3 append

append(self, other, ignore_index=False, verify_integrity=False)

竖方向合并df,没有axis属性,不会就地修改,而是会创建副本

示例:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df1 = pd.DataFrame(np.ones((4, 4)) * 1, columns=list('ABCD'), index=list('1234'))

In [4]: df2 = pd.DataFrame(np.ones((4, 4)) * 2, columns=list('ABEF'), index=list('3456'))

In [5]: df3 = pd.DataFrame(np.ones((4, 4)) * 3, columns=list('CDEF'), index=list('1256'))

In [6]: df1
Out[6]: 
     A    B    C    D
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0

In [7]: df2
Out[7]: 
     A    B    E    F
3  2.0  2.0  2.0  2.0
4  2.0  2.0  2.0  2.0
5  2.0  2.0  2.0  2.0
6  2.0  2.0  2.0  2.0

In [8]: df3
Out[8]: 
     C    D    E    F
1  3.0  3.0  3.0  3.0
2  3.0  3.0  3.0  3.0
5  3.0  3.0  3.0  3.0
6  3.0  3.0  3.0  3.0

In [9]: df1.append(df2)    # 相当于pd.concat([df1, df2])
Out[9]: 
     A    B    C    D    E    F
1  1.0  1.0  1.0  1.0  NaN  NaN
2  1.0  1.0  1.0  1.0  NaN  NaN
3  1.0  1.0  1.0  1.0  NaN  NaN
4  1.0  1.0  1.0  1.0  NaN  NaN
3  2.0  2.0  NaN  NaN  2.0  2.0
4  2.0  2.0  NaN  NaN  2.0  2.0
5  2.0  2.0  NaN  NaN  2.0  2.0
6  2.0  2.0  NaN  NaN  2.0  2.0

3.1 ignore_index属性

默认值:ignore_index=False

合并方向是否忽略原行名称,而采用系统默认的索引,即从0开始的int。

In [10]: df1.append(df2, ignore_index=True)
Out[10]: 
     A    B    C    D    E    F
0  1.0  1.0  1.0  1.0  NaN  NaN
1  1.0  1.0  1.0  1.0  NaN  NaN
2  1.0  1.0  1.0  1.0  NaN  NaN
3  1.0  1.0  1.0  1.0  NaN  NaN
4  2.0  2.0  NaN  NaN  2.0  2.0
5  2.0  2.0  NaN  NaN  2.0  2.0
6  2.0  2.0  NaN  NaN  2.0  2.0
7  2.0  2.0  NaN  NaN  2.0  2.0

3.2 append多个DataFrame

和concat相同,append也支持append多个DataFrame

In [11]: df1.append([df2, df3], ignore_index=True)
Out[11]: 
      A    B    C    D    E    F
0   1.0  1.0  1.0  1.0  NaN  NaN
1   1.0  1.0  1.0  1.0  NaN  NaN
2   1.0  1.0  1.0  1.0  NaN  NaN
3   1.0  1.0  1.0  1.0  NaN  NaN
4   2.0  2.0  NaN  NaN  2.0  2.0
5   2.0  2.0  NaN  NaN  2.0  2.0
6   2.0  2.0  NaN  NaN  2.0  2.0
7   2.0  2.0  NaN  NaN  2.0  2.0
8   NaN  NaN  3.0  3.0  3.0  3.0
9   NaN  NaN  3.0  3.0  3.0  3.0
10  NaN  NaN  3.0  3.0  3.0  3.0
11  NaN  NaN  3.0  3.0  3.0  3.0

4 merge

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

示例:

In [12]: import pandas as pd

In [13]: left = pd.DataFrame({'A': ['a0', 'a1', 'a2', 'a3'],
    ...:                      'B': ['b0', 'b1', 'b2', 'b3'],
    ...:                      'k1': ['x', 'x', 'y', 'y']})

In [14]: right = pd.DataFrame({'C': ['c1', 'c2', 'c3', 'c4'],
    ...:                       'D': ['d1', 'd2', 'd3', 'd4'],
    ...:                       'k1': ['y', 'y', 'z', 'z']})

In [15]: left
Out[15]: 
    A   B k1
0  a0  b0  x
1  a1  b1  x
2  a2  b2  y
3  a3  b3  y

In [16]: right
Out[16]: 
    C   D k1
0  c1  d1  y
1  c2  d2  y
2  c3  d3  z
3  c4  d4  z

对 left 和 right 进行merge:

In [17]: pd.merge(left, right)
Out[17]: 
    A   B k1   C   D
0  a2  b2  y  c1  d1
1  a2  b2  y  c2  d2
2  a3  b3  y  c1  d1
3  a3  b3  y  c2  d2

可以看到只有 left 和 right 的key1=y的行保留了下来,即默认合并后只保留有共同列项并且值相等行(即交集)。

本例中 left 和 right 的k1=y分别有2个,最终构成了2*2=4行。

如果没有共同列会报错:

In [18]: del left['k1']

In [19]: pd.merge(left, right)
---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
<ipython-input-19-6264bf127cc8> in <module>
----> 1 pd.merge(left, right)

d:\python\python36\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, r
ight_index, sort, suffixes, copy, indicator)
     51                          right_on=right_on, left_index=left_index,
     52                          right_index=right_index, sort=sort, suffixes=suffixes,
---> 53                          copy=copy, indicator=indicator)
     54     return op.get_result()
     55

d:\python\python36\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis
, left_index, right_index, sort, suffixes, copy, indicator)
    551             warnings.warn(msg, UserWarning)
    552
--> 553         self._validate_specification()
    554
    555         # note this function has side effects

d:\python\python36\lib\site-packages\pandas\core\reshape\merge.py in _validate_specification(self)
    926                     self.right.columns)
    927                 if len(common_cols) == 0:
--> 928                     raise MergeError('No common columns to perform merge on')
    929                 if not common_cols.is_unique:
    930                     raise MergeError("Data columns not unique: %s"

MergeError: No common columns to perform merge on

4.1 on属性

新增一个共同列,但没有相等的值,发现合并返回是空列表,因为默认只保留所有共同列都相等的行:

In [21]: left['k2'] = list('1234')

In [22]: right['k2'] = list('5678')

In [23]: pd.merge(left, right)
Out[23]: 
Empty DataFrame
Columns: [A, B, k1, k2, C, D]
Index: []

默认值:on的默认值是所有共同列,本例为:on=['k1', 'k2']

可以指定on,设定合并基准列,就可以根据k1进行合并,并且left和right共同列k2会同时变换名称后保留下来:

In [24]: pd.merge(left, right, on='k1')
Out[24]: 
    A   B k1 k2_x   C   D k2_y
0  a2  b2  y    3  c1  d1    5
1  a2  b2  y    3  c2  d2    6
2  a3  b3  y    4  c1  d1    5
3  a3  b3  y    4  c2  d2    6

4.2 how属性

how取值范围:'inner', 'outer', 'left', 'right'

默认值:how='inner'

‘inner’:共同列的值必须完全相等:

In [25]: pd.merge(left, right, on='k1', how='inner')
Out[25]: 
    A   B k1 k2_x   C   D k2_y
0  a2  b2  y    3  c1  d1    5
1  a2  b2  y    3  c2  d2    6
2  a3  b3  y    4  c1  d1    5
3  a3  b3  y    4  c2  d2    6

‘outer’:共同列的值都会保留,left或right在共同列上的差集,会对它们的缺失列项的值赋上NaN:

In [26]: pd.merge(left, right, on='k1', how='outer')
Out[26]: 
     A    B k1 k2_x    C    D k2_y
0   a0   b0  x    1  NaN  NaN  NaN
1   a1   b1  x    2  NaN  NaN  NaN
2   a2   b2  y    3   c1   d1    5
3   a2   b2  y    3   c2   d2    6
4   a3   b3  y    4   c1   d1    5
5   a3   b3  y    4   c2   d2    6
6  NaN  NaN  z  NaN   c3   d3    7
7  NaN  NaN  z  NaN   c4   d4    8

‘left’:根据左边的DataFrame确定共同列的保留值,右边缺失列项的值赋上NaN:

In [27]: pd.merge(left, right, on='k1', how='left')
Out[27]: 
    A   B k1 k2_x    C    D k2_y
0  a0  b0  x    1  NaN  NaN  NaN
1  a1  b1  x    2  NaN  NaN  NaN
2  a2  b2  y    3   c1   d1    5
3  a2  b2  y    3   c2   d2    6
4  a3  b3  y    4   c1   d1    5
5  a3  b3  y    4   c2   d2    6

‘right’:根据右边的DataFrame确定共同列的保留值,左边缺失列项的值赋上NaN:

In [28]: pd.merge(left, right, on='k1', how='right')
Out[28]: 
     A    B k1 k2_x   C   D k2_y
0   a2   b2  y    3  c1  d1    5
1   a3   b3  y    4  c1  d1    5
2   a2   b2  y    3  c2  d2    6
3   a3   b3  y    4  c2  d2    6
4  NaN  NaN  z  NaN  c3  d3    7
5  NaN  NaN  z  NaN  c4  d4    8

4.3 indicator

默认值:indicator=False,不显示合并方式

设置True表示显示合并方式,即left / right / both:

In [29]:  pd.merge(left, right, on='k1', how='outer', indicator=True)
Out[29]: 
     A    B k1 k2_x    C    D k2_y      _merge
0   a0   b0  x    1  NaN  NaN  NaN   left_only
1   a1   b1  x    2  NaN  NaN  NaN   left_only
2   a2   b2  y    3   c1   d1    5        both
3   a2   b2  y    3   c2   d2    6        both
4   a3   b3  y    4   c1   d1    5        both
5   a3   b3  y    4   c2   d2    6        both
6  NaN  NaN  z  NaN   c3   d3    7  right_only
7  NaN  NaN  z  NaN   c4   d4    8  right_only
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值