1 准备工作
创建3个DataFrame:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df1 = pd.DataFrame(np.ones((4, 4)) * 1, columns=list('ABCD'), index=list('1234'))
In [4]: df2 = pd.DataFrame(np.ones((4, 4)) * 2, columns=list('ABEF'), index=list('3456'))
In [5]: df3 = pd.DataFrame(np.ones((4, 4)) * 3, columns=list('CDEF'), index=list('1256'))
In [6]: df1
Out[6]:
A B C D
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
In [7]: df2
Out[7]:
A B E F
3 2.0 2.0 2.0 2.0
4 2.0 2.0 2.0 2.0
5 2.0 2.0 2.0 2.0
6 2.0 2.0 2.0 2.0
In [8]: df3
Out[8]:
C D E F
1 3.0 3.0 3.0 3.0
2 3.0 3.0 3.0 3.0
5 3.0 3.0 3.0 3.0
6 3.0 3.0 3.0 3.0
2 concat 合并
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)
示例:
In [9]: pd.concat([df1, df2])
Out[9]:
A B C D E F
1 1.0 1.0 1.0 1.0 NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN
4 1.0 1.0 1.0 1.0 NaN NaN
3 2.0 2.0 NaN NaN 2.0 2.0
4 2.0 2.0 NaN NaN 2.0 2.0
5 2.0 2.0 NaN NaN 2.0 2.0
6 2.0 2.0 NaN NaN 2.0 2.0
2.1 axis
默认值:axis=0
axis=0:竖方向(index)合并,合并方向index作列表相加,非合并方向columns取并集
axis=1:横方向(columns)合并,合并方向columns作列表相加,非合并方向index取并集
axis=0:
In [10]: pd.concat([df1, df2], axis=0)
Out[10]:
A B C D E F
1 1.0 1.0 1.0 1.0 NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN
4 1.0 1.0 1.0 1.0 NaN NaN
3 2.0 2.0 NaN NaN 2.0 2.0
4 2.0 2.0 NaN NaN 2.0 2.0
5 2.0 2.0 NaN NaN 2.0 2.0
6 2.0 2.0 NaN NaN 2.0 2.0
axis=1:
In [11]: pd.concat([df1, df2], axis=1)
Out[11]:
A B C D A B E F
1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
4 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
5 NaN NaN NaN NaN 2.0 2.0 2.0 2.0
6 NaN NaN NaN NaN 2.0 2.0 2.0 2.0
注意:原df中,取并集的行/列名称不能有重复项,即axis=0时columns不能有重复项,axis=1时index不能有重复项。
In [12]: df1.columns = list('ABCC')
In [13]: pd.concat([df1, df2], axis=0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-86fa31c57748> in <module>
----> 1 pd.concat([df1, df2], axis=0)
d:\python\python36\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, lev
els, names, verify_integrity, copy)
205 verify_integrity=verify_integrity,
206 copy=copy)
--> 207 return op.get_result()
208
209
d:\python\python36\lib\site-packages\pandas\core\reshape\concat.py in get_result(self)
405 new_data = concatenate_block_managers(
406 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 407 copy=self.copy)
408 if not self.copy:
409 new_data._consolidate_inplace()
d:\python\python36\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, cop
y)
4830 blocks = [make_block(
4831 concatenate_join_units(join_units, concat_axis, copy=copy),
-> 4832 placement=placement) for placement, join_units in concat_plan]
4833
4834 return BlockManager(blocks, axes)
d:\python\python36\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
4828 for mgr, indexers in mgrs_indexers], concat_axis)
4829
-> 4830 blocks = [make_block(
4831 concatenate_join_units(join_units, concat_axis, copy=copy),
4832 placement=placement) for placement, join_units in concat_plan]
d:\python\python36\lib\site-packages\pandas\core\internals.py in combine_concat_plans(plans, concat_axis)
5065 while num_ended[0] != len(next_items):
5066 if num_ended[0] > 0:
-> 5067 raise ValueError("Plan shapes are not aligned")
5068
5069 placements, units = zip(*next_items)
ValueError: Plan shapes are not aligned
2.2 join
默认值:join='outer'
非合并方向的行/列名称:取交集(inner),取并集(outer)。
axis=0时,join='inner',columns取交集:
In [15]: pd.concat([df1, df2], axis=0, join='inner')
Out[15]:
A B
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
3 2.0 2.0
4 2.0 2.0
5 2.0 2.0
6 2.0 2.0
axis=1时join='inner',index取交集:
In [16]: pd.concat([df1, df2], axis=1, join='inner')
Out[16]:
A B C D A B E F
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
4 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
2.3 join_axes
默认值:join_axes=None,取并集
合并后,可以设置非合并方向的行/列名称,使用某个df的行/列名称
axis=0时,join_axes=[df1.columns],合并后columns使用df1的:
In [17]: pd.concat([df1, df2], axis=0, join_axes=[df1.columns])
Out[17]:
A B C D
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
3 2.0 2.0 NaN NaN
4 2.0 2.0 NaN NaN
5 2.0 2.0 NaN NaN
6 2.0 2.0 NaN NaN
axis=1时,axes=[df1.index],合并后index使用df2的:
In [18]: pd.concat([df1, df2], axis=1, join_axes=[df1.index])
Out[18]:
A B C D A B E F
1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
4 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
同时设置join和join_axes,以join_axes为准(相当于只设置了join_axes,没有设置join):
In [19]: pd.concat([df1, df2], axis=0, join='inner', join_axes=[df1.columns])
Out[19]:
A B C D
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
3 2.0 2.0 NaN NaN
4 2.0 2.0 NaN NaN
5 2.0 2.0 NaN NaN
6 2.0 2.0 NaN NaN
2.4 ignore_index
默认值:ignore_index=False
合并方向是否忽略原行/列名称,而采用系统默认的索引,即从0开始的int。
axis=0时ignore_index=True,index采用系统默认索引:
In [20]: pd.concat([df1, df2], axis=0, ignore_index=True)
Out[20]:
A B C D E F
0 1.0 1.0 1.0 1.0 NaN NaN
1 1.0 1.0 1.0 1.0 NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN
4 2.0 2.0 NaN NaN 2.0 2.0
5 2.0 2.0 NaN NaN 2.0 2.0
6 2.0 2.0 NaN NaN 2.0 2.0
7 2.0 2.0 NaN NaN 2.0 2.0
axis=1时ignore_index=True,columns采用系统默认索引:
In [21]: pd.concat([df1, df2], axis=1, ignore_index=True)
Out[21]:
0 1 2 3 4 5 6 7
1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
4 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
5 NaN NaN NaN NaN 2.0 2.0 2.0 2.0
6 NaN NaN NaN NaN 2.0 2.0 2.0 2.0
2.5 keys
默认值:keys=None
可以加一层标签,标识行/列名称属于原来哪个df。
axis=0时设置keys:
In [22]: pd.concat([df1, df2], axis=0, keys=['x', 'y'])
Out[22]:
A B C D E F
x 1 1.0 1.0 1.0 1.0 NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN
4 1.0 1.0 1.0 1.0 NaN NaN
y 3 2.0 2.0 NaN NaN 2.0 2.0
4 2.0 2.0 NaN NaN 2.0 2.0
5 2.0 2.0 NaN NaN 2.0 2.0
6 2.0 2.0 NaN NaN 2.0 2.0
axis=1时设置keys:
In [23]: pd.concat([df1, df2], axis=1, keys=['x', 'y'])
Out[23]:
x y
A B C D A B E F
1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
4 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
5 NaN NaN NaN NaN 2.0 2.0 2.0 2.0
6 NaN NaN NaN NaN 2.0 2.0 2.0 2.0
也可以传字典取代keys:
In [24]: pd.concat({'x': df1, 'y': df2}, axis=0)
Out[24]:
A B C D E F
x 1 1.0 1.0 1.0 1.0 NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN
4 1.0 1.0 1.0 1.0 NaN NaN
y 3 2.0 2.0 NaN NaN 2.0 2.0
4 2.0 2.0 NaN NaN 2.0 2.0
5 2.0 2.0 NaN NaN 2.0 2.0
6 2.0 2.0 NaN NaN 2.0 2.0
2.6 levels
默认值:levels=None
明确行/列名称取值范围
In [27]: df = pd.concat([df1, df2], axis=0, keys=['x', 'y'], levels=[['x', 'y', 'z', 'w']])
In [28]: df
Out[28]:
A B C D E F
x 1 1.0 1.0 1.0 1.0 NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN
4 1.0 1.0 1.0 1.0 NaN NaN
y 3 2.0 2.0 NaN NaN 2.0 2.0
4 2.0 2.0 NaN NaN 2.0 2.0
5 2.0 2.0 NaN NaN 2.0 2.0
6 2.0 2.0 NaN NaN 2.0 2.0
In [29]: df.index.levels
Out[29]: FrozenList([['x', 'y', 'z', 'w'], ['1', '2', '3', '4', '5', '6']])
2.7 concat多个DataFrame
In [39]: pd.concat([df1, df2, df3], join_axes=[df1.columns])
Out[39]:
A B C D
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
3 2.0 2.0 NaN NaN
4 2.0 2.0 NaN NaN
5 2.0 2.0 NaN NaN
6 2.0 2.0 NaN NaN
1 NaN NaN 3.0 3.0
2 NaN NaN 3.0 3.0
5 NaN NaN 3.0 3.0
6 NaN NaN 3.0 3.0
3 append
append(self, other, ignore_index=False, verify_integrity=False)
竖方向合并df,没有axis属性,不会就地修改,而是会创建副本
示例:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df1 = pd.DataFrame(np.ones((4, 4)) * 1, columns=list('ABCD'), index=list('1234'))
In [4]: df2 = pd.DataFrame(np.ones((4, 4)) * 2, columns=list('ABEF'), index=list('3456'))
In [5]: df3 = pd.DataFrame(np.ones((4, 4)) * 3, columns=list('CDEF'), index=list('1256'))
In [6]: df1
Out[6]:
A B C D
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
In [7]: df2
Out[7]:
A B E F
3 2.0 2.0 2.0 2.0
4 2.0 2.0 2.0 2.0
5 2.0 2.0 2.0 2.0
6 2.0 2.0 2.0 2.0
In [8]: df3
Out[8]:
C D E F
1 3.0 3.0 3.0 3.0
2 3.0 3.0 3.0 3.0
5 3.0 3.0 3.0 3.0
6 3.0 3.0 3.0 3.0
In [9]: df1.append(df2) # 相当于pd.concat([df1, df2])
Out[9]:
A B C D E F
1 1.0 1.0 1.0 1.0 NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN
4 1.0 1.0 1.0 1.0 NaN NaN
3 2.0 2.0 NaN NaN 2.0 2.0
4 2.0 2.0 NaN NaN 2.0 2.0
5 2.0 2.0 NaN NaN 2.0 2.0
6 2.0 2.0 NaN NaN 2.0 2.0
3.1 ignore_index属性
默认值:ignore_index=False
合并方向是否忽略原行名称,而采用系统默认的索引,即从0开始的int。
In [10]: df1.append(df2, ignore_index=True)
Out[10]:
A B C D E F
0 1.0 1.0 1.0 1.0 NaN NaN
1 1.0 1.0 1.0 1.0 NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN
4 2.0 2.0 NaN NaN 2.0 2.0
5 2.0 2.0 NaN NaN 2.0 2.0
6 2.0 2.0 NaN NaN 2.0 2.0
7 2.0 2.0 NaN NaN 2.0 2.0
3.2 append多个DataFrame
和concat相同,append也支持append多个DataFrame
In [11]: df1.append([df2, df3], ignore_index=True)
Out[11]:
A B C D E F
0 1.0 1.0 1.0 1.0 NaN NaN
1 1.0 1.0 1.0 1.0 NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN
4 2.0 2.0 NaN NaN 2.0 2.0
5 2.0 2.0 NaN NaN 2.0 2.0
6 2.0 2.0 NaN NaN 2.0 2.0
7 2.0 2.0 NaN NaN 2.0 2.0
8 NaN NaN 3.0 3.0 3.0 3.0
9 NaN NaN 3.0 3.0 3.0 3.0
10 NaN NaN 3.0 3.0 3.0 3.0
11 NaN NaN 3.0 3.0 3.0 3.0
4 merge
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
示例:
In [12]: import pandas as pd
In [13]: left = pd.DataFrame({'A': ['a0', 'a1', 'a2', 'a3'],
...: 'B': ['b0', 'b1', 'b2', 'b3'],
...: 'k1': ['x', 'x', 'y', 'y']})
In [14]: right = pd.DataFrame({'C': ['c1', 'c2', 'c3', 'c4'],
...: 'D': ['d1', 'd2', 'd3', 'd4'],
...: 'k1': ['y', 'y', 'z', 'z']})
In [15]: left
Out[15]:
A B k1
0 a0 b0 x
1 a1 b1 x
2 a2 b2 y
3 a3 b3 y
In [16]: right
Out[16]:
C D k1
0 c1 d1 y
1 c2 d2 y
2 c3 d3 z
3 c4 d4 z
对 left 和 right 进行merge:
In [17]: pd.merge(left, right)
Out[17]:
A B k1 C D
0 a2 b2 y c1 d1
1 a2 b2 y c2 d2
2 a3 b3 y c1 d1
3 a3 b3 y c2 d2
可以看到只有 left 和 right 的key1=y的行保留了下来,即默认合并后只保留有共同列项并且值相等行(即交集)。
本例中 left 和 right 的k1=y分别有2个,最终构成了2*2=4行。
如果没有共同列会报错:
In [18]: del left['k1']
In [19]: pd.merge(left, right)
---------------------------------------------------------------------------
MergeError Traceback (most recent call last)
<ipython-input-19-6264bf127cc8> in <module>
----> 1 pd.merge(left, right)
d:\python\python36\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, r
ight_index, sort, suffixes, copy, indicator)
51 right_on=right_on, left_index=left_index,
52 right_index=right_index, sort=sort, suffixes=suffixes,
---> 53 copy=copy, indicator=indicator)
54 return op.get_result()
55
d:\python\python36\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis
, left_index, right_index, sort, suffixes, copy, indicator)
551 warnings.warn(msg, UserWarning)
552
--> 553 self._validate_specification()
554
555 # note this function has side effects
d:\python\python36\lib\site-packages\pandas\core\reshape\merge.py in _validate_specification(self)
926 self.right.columns)
927 if len(common_cols) == 0:
--> 928 raise MergeError('No common columns to perform merge on')
929 if not common_cols.is_unique:
930 raise MergeError("Data columns not unique: %s"
MergeError: No common columns to perform merge on
4.1 on属性
新增一个共同列,但没有相等的值,发现合并返回是空列表,因为默认只保留所有共同列都相等的行:
In [21]: left['k2'] = list('1234')
In [22]: right['k2'] = list('5678')
In [23]: pd.merge(left, right)
Out[23]:
Empty DataFrame
Columns: [A, B, k1, k2, C, D]
Index: []
默认值:on的默认值是所有共同列,本例为:on=['k1', 'k2']
可以指定on,设定合并基准列,就可以根据k1进行合并,并且left和right共同列k2会同时变换名称后保留下来:
In [24]: pd.merge(left, right, on='k1')
Out[24]:
A B k1 k2_x C D k2_y
0 a2 b2 y 3 c1 d1 5
1 a2 b2 y 3 c2 d2 6
2 a3 b3 y 4 c1 d1 5
3 a3 b3 y 4 c2 d2 6
4.2 how属性
how取值范围:'inner', 'outer', 'left', 'right'
默认值:how='inner'
‘inner’:共同列的值必须完全相等:
In [25]: pd.merge(left, right, on='k1', how='inner')
Out[25]:
A B k1 k2_x C D k2_y
0 a2 b2 y 3 c1 d1 5
1 a2 b2 y 3 c2 d2 6
2 a3 b3 y 4 c1 d1 5
3 a3 b3 y 4 c2 d2 6
‘outer’:共同列的值都会保留,left或right在共同列上的差集,会对它们的缺失列项的值赋上NaN:
In [26]: pd.merge(left, right, on='k1', how='outer')
Out[26]:
A B k1 k2_x C D k2_y
0 a0 b0 x 1 NaN NaN NaN
1 a1 b1 x 2 NaN NaN NaN
2 a2 b2 y 3 c1 d1 5
3 a2 b2 y 3 c2 d2 6
4 a3 b3 y 4 c1 d1 5
5 a3 b3 y 4 c2 d2 6
6 NaN NaN z NaN c3 d3 7
7 NaN NaN z NaN c4 d4 8
‘left’:根据左边的DataFrame确定共同列的保留值,右边缺失列项的值赋上NaN:
In [27]: pd.merge(left, right, on='k1', how='left')
Out[27]:
A B k1 k2_x C D k2_y
0 a0 b0 x 1 NaN NaN NaN
1 a1 b1 x 2 NaN NaN NaN
2 a2 b2 y 3 c1 d1 5
3 a2 b2 y 3 c2 d2 6
4 a3 b3 y 4 c1 d1 5
5 a3 b3 y 4 c2 d2 6
‘right’:根据右边的DataFrame确定共同列的保留值,左边缺失列项的值赋上NaN:
In [28]: pd.merge(left, right, on='k1', how='right')
Out[28]:
A B k1 k2_x C D k2_y
0 a2 b2 y 3 c1 d1 5
1 a3 b3 y 4 c1 d1 5
2 a2 b2 y 3 c2 d2 6
3 a3 b3 y 4 c2 d2 6
4 NaN NaN z NaN c3 d3 7
5 NaN NaN z NaN c4 d4 8
4.3 indicator
默认值:indicator=False,不显示合并方式
设置True表示显示合并方式,即left / right / both:
In [29]: pd.merge(left, right, on='k1', how='outer', indicator=True)
Out[29]:
A B k1 k2_x C D k2_y _merge
0 a0 b0 x 1 NaN NaN NaN left_only
1 a1 b1 x 2 NaN NaN NaN left_only
2 a2 b2 y 3 c1 d1 5 both
3 a2 b2 y 3 c2 d2 6 both
4 a3 b3 y 4 c1 d1 5 both
5 a3 b3 y 4 c2 d2 6 both
6 NaN NaN z NaN c3 d3 7 right_only
7 NaN NaN z NaN c4 d4 8 right_only