pandas
处理多组数据的时候往往会要用到数据的合并处理,其中有三种方式,concat、append和merge。
1、concat
用concat
是一种基本的合并方式。而且concat
中有很多参数可以调整,合并成你想要的数据形式。axis来指明合并方向。axis=0
是预设值,因此未设定任何参数时,函数默认axis=0
。(0表示上下合并,1表示左右合并)
-
import pandas
as pd
-
import numpy
as np
-
-
#定义资料集
-
df1 = pd.DataFrame(np.ones((
3,
4))*
0, columns=[
'a',
'b',
'c',
'd'])
-
df2 = pd.DataFrame(np.ones((
3,
4))*
1, columns=[
'a',
'b',
'c',
'd'])
-
df3 = pd.DataFrame(np.ones((
3,
4))*
2, columns=[
'a',
'b',
'c',
'd'])
-
-
#concat纵向合并
-
res = pd.concat([df1, df2, df3], axis=
0)
-
-
#打印结果
-
print(res)
-
'''
-
a b c d
-
0 0.0 0.0 0.0 0.0
-
1 0.0 0.0 0.0 0.0
-
2 0.0 0.0 0.0 0.0
-
0 1.0 1.0 1.0 1.0
-
1 1.0 1.0 1.0 1.0
-
2 1.0 1.0 1.0 1.0
-
0 2.0 2.0 2.0 2.0
-
1 2.0 2.0 2.0 2.0
-
2 2.0 2.0 2.0 2.0
-
'''
上述index为0,1,2,0,1,2形式。为什么会出现这样的情况,其实是仍然按照合并前的index组合起来的。若希望递增,请看下面示例:
ignore_index (重置 index)
重置后的index为0,1,……8
-
res = pd.concat([df1, df2, df3], axis=
0, ignore_index=
True)
# 将ignore_index设置为True
-
-
print(res)
#打印结果
-
'''
-
a b c d
-
0 0.0 0.0 0.0 0.0
-
1 0.0 0.0 0.0 0.0
-
2 0.0 0.0 0.0 0.0
-
3 1.0 1.0 1.0 1.0
-
4 1.0 1.0 1.0 1.0
-
5 1.0 1.0 1.0 1.0
-
6 2.0 2.0 2.0 2.0
-
7 2.0 2.0 2.0 2.0
-
8 2.0 2.0 2.0 2.0
-
'''
join (合并方式)
join='outer'
为预设值,因此未设定任何参数时,函数默认join='outer'
。此方式是依照column
来做纵向合并,有相同的column
上下合并在一起,其他独自的column
个自成列,原本没有值的位置皆以NaN
填充。
-
import pandas
as pd
-
import numpy
as np
-
-
#定义资料集
-
df1 = pd.DataFrame(np.ones((
3,
4))*
0, columns=[
'a',
'b',
'c',
'd'], index=[
1,
2,
3])
-
df2 = pd.DataFrame(np.ones((
3,
4))*
1, columns=[
'b',
'c',
'd',
'e'], index=[
2,
3,
4])
-
-
res = pd.concat([df1, df2], axis=
0, join=
'outer')
#纵向"外"合并df1与df2
-
-
print(res)
-
'''
-
a b c d e
-
1 0.0 0.0 0.0 0.0 NaN
-
2 0.0 0.0 0.0 0.0 NaN
-
3 0.0 0.0 0.0 0.0 NaN
-
2 NaN 1.0 1.0 1.0 1.0
-
3 NaN 1.0 1.0 1.0 1.0
-
4 NaN 1.0 1.0 1.0 1.0
-
'''
-
res = pd.concat([df1, df2], axis=
0, join=
'inner')
#纵向"内"合并df1与df2
-
-
#打印结果
-
print(res)
-
'''
-
b c d
-
1 0.0 0.0 0.0
-
2 0.0 0.0 0.0
-
3 0.0 0.0 0.0
-
2 1.0 1.0 1.0
-
3 1.0 1.0 1.0
-
4 1.0 1.0 1.0
-
'''
join_axes (依照 axes 合并)
-
import pandas
as pd
-
import numpy
as np
-
-
#定义资料集
-
df1 = pd.DataFrame(np.ones((
3,
4))*
0, columns=[
'a',
'b',
'c',
'd'], index=[
1,
2,
3])
-
df2 = pd.DataFrame(np.ones((
3,
4))*
1, columns=[
'b',
'c',
'd',
'e'], index=[
2,
3,
4])
-
-
#依照`df1.index`进行横向合并
-
res = pd.concat([df1, df2], axis=
1, join_axes=[df1.index])
-
-
#打印结果
-
print(res)
-
# a b c d b c d e
-
# 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
-
# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
-
# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
上述脚本中,join_axes=[df1.index]表明按照df1的index来合并,可以看到结果中去掉了df2中出现但df1中没有的index=4这一行。
2、append (添加数据)
append
只有纵向合并,没有横向合并。
-
import pandas
as pd
-
import numpy
as np
-
-
#定义资料集
-
df1 = pd.DataFrame(np.ones((
3,
4))*
0, columns=[
'a',
'b',
'c',
'd'])
-
df2 = pd.DataFrame(np.ones((
3,
4))*
1, columns=[
'a',
'b',
'c',
'd'])
-
df3 = pd.DataFrame(np.ones((
3,
4))*
1, columns=[
'a',
'b',
'c',
'd'])
-
s1 = pd.Series([
1,
2,
3,
4], index=[
'a',
'b',
'c',
'd'])
-
-
#将df2合并到df1的下面,以及重置index,并打印出结果
-
res = df1.append(df2, ignore_index=
True)
-
print(res)
-
# a b c d
-
# 0 0.0 0.0 0.0 0.0
-
# 1 0.0 0.0 0.0 0.0
-
# 2 0.0 0.0 0.0 0.0
-
# 3 1.0 1.0 1.0 1.0
-
# 4 1.0 1.0 1.0 1.0
-
# 5 1.0 1.0 1.0 1.0
-
-
#合并多个df,将df2与df3合并至df1的下面,以及重置index,并打印出结果
-
res = df1.append([df2, df3], ignore_index=
True)
-
print(res)
-
# a b c d
-
# 0 0.0 0.0 0.0 0.0
-
# 1 0.0 0.0 0.0 0.0
-
# 2 0.0 0.0 0.0 0.0
-
# 3 1.0 1.0 1.0 1.0
-
# 4 1.0 1.0 1.0 1.0
-
# 5 1.0 1.0 1.0 1.0
-
# 6 1.0 1.0 1.0 1.0
-
# 7 1.0 1.0 1.0 1.0
-
# 8 1.0 1.0 1.0 1.0
-
-
#合并series,将s1合并至df1,以及重置index,并打印出结果
-
res = df1.append(s1, ignore_index=
True)
-
print(res)
-
# a b c d
-
# 0 0.0 0.0 0.0 0.0
-
# 1 0.0 0.0 0.0 0.0
-
# 2 0.0 0.0 0.0 0.0
-
# 3 1.0 2.0 3.0 4.0
3、merge
根据两组数据中的关键字key来合并(key在两组数据中是完全一致的)。
3.1依据一组key合并
-
import pandas
as pd
-
-
#定义资料集并打印出
-
left = pd.DataFrame({
'key': [
'K0',
'K1',
'K2',
'K3'],
-
'A': [
'A0',
'A1',
'A2',
'A3'],
-
'B': [
'B0',
'B1',
'B2',
'B3']})
-
right = pd.DataFrame({
'key': [
'K0',
'K1',
'K2',
'K3'],
-
'C': [
'C0',
'C1',
'C2',
'C3'],
-
'D': [
'D0',
'D1',
'D2',
'D3']})
-
-
print(left)
-
# A B key
-
# 0 A0 B0 K0
-
# 1 A1 B1 K1
-
# 2 A2 B2 K2
-
# 3 A3 B3 K3
-
-
print(right)
-
# C D key
-
# 0 C0 D0 K0
-
# 1 C1 D1 K1
-
# 2 C2 D2 K2
-
# 3 C3 D3 K3
-
-
#依据key column合并,并打印出
-
res = pd.merge(left, right, on=
'key')
-
-
print(res)
-
A B key C D
-
# 0 A0 B0 K0 C0 D0
-
# 1 A1 B1 K1 C1 D1
-
# 2 A2 B2 K2 C2 D2
-
# 3 A3 B3 K3 C3 D3
3.2 根据两组key合并
合并时有4种方法how = ['left', 'right', 'outer', 'inner']
,预设值how='inner'
。
- inner:按照关键字组合之后,去掉组合中有合并项为NaN的行。
- outer :保留所有组合
- left:仅保留左边合并项为NaN的行
- right:仅保留右边合并项为NaN的行
-
import pandas
as pd
-
import numpy
as np
-
-
#定义资料集并打印出
-
left = pd.DataFrame({
'key1': [
'K0',
'K0',
'K1',
'K2'],
-
'key2': [
'K0',
'K1',
'K0',
'K1'],
-
'A': [
'A0',
'A1',
'A2',
'A3'],
-
'B': [
'B0',
'B1',
'B2',
'B3']})
-
right = pd.DataFrame({
'key1': [
'K0',
'K1',
'K1',
'K2'],
-
'key2': [
'K0',
'K0',
'K0',
'K0'],
-
'C': [
'C0',
'C1',
'C2',
'C3'],
-
'D': [
'D0',
'D1',
'D2',
'D3']})
-
-
print(left)
-
'''
-
key1 key2 A B
-
0 K0 K0 A0 B0
-
1 K0 K1 A1 B1
-
2 K1 K0 A2 B2
-
3 K2 K1 A3 B3
-
'''
-
print(right)
-
'''
-
key1 key2 C D
-
0 K0 K0 C0 D0
-
1 K1 K0 C1 D1
-
2 K1 K0 C2 D2
-
3 K2 K0 C3 D3
-
'''
-
-
#依据key1与key2 columns进行合并,并打印出四种结果['left', 'right', 'outer', 'inner']
-
res = pd.merge(left, right, on=[
'key1',
'key2'], how=
'inner')
-
print(res)
-
'''
-
key1 key2 A B C D
-
0 K0 K0 A0 B0 C0 D0
-
1 K1 K0 A2 B2 C1 D1
-
2 K1 K0 A2 B2 C2 D2
-
'''
-
res = pd.merge(left, right, on=[
'key1',
'key2'], how=
'outer')
-
print(res)
-
'''
-
key1 key2 A B C D
-
0 K0 K0 A0 B0 C0 D0
-
1 K0 K1 A1 B1 NaN NaN
-
2 K1 K0 A2 B2 C1 D1
-
3 K1 K0 A2 B2 C2 D2
-
4 K2 K1 A3 B3 NaN NaN
-
5 K2 K0 NaN NaN C3 D3
-
'''
-
res = pd.merge(left, right, on=[
'key1',
'key2'], how=
'left')
-
print(res)
-
'''
-
key1 key2 A B C D
-
0 K0 K0 A0 B0 C0 D0
-
1 K0 K1 A1 B1 NaN NaN
-
2 K1 K0 A2 B2 C1 D1
-
3 K1 K0 A2 B2 C2 D2
-
4 K2 K1 A3 B3 NaN NaN
-
'''
-
res = pd.merge(left, right, on=[
'key1',
'key2'], how=
'right')
-
print(res)
-
'''
-
key1 key2 A B C D
-
0 K0 K0 A0 B0 C0 D0
-
1 K1 K0 A2 B2 C1 D1
-
2 K1 K0 A2 B2 C2 D2
-
3 K2 K0 NaN NaN C3 D3
-
'''
3.3 Indicator
indicator=True
会将合并的记录放在新的一列。
-
import pandas
as pd
-
-
#定义资料集并打印出
-
df1 = pd.DataFrame({
'col1':[
0,
1],
'col_left':[
'a',
'b']})
-
df2 = pd.DataFrame({
'col1':[
1,
2,
2],
'col_right':[
2,
2,
2]})
-
-
print(df1)
-
# col1 col_left
-
# 0 0 a
-
# 1 1 b
-
-
print(df2)
-
# col1 col_right
-
# 0 1 2
-
# 1 2 2
-
# 2 2 2
-
-
# 依据col1进行合并,并启用indicator=True,最后打印出
-
res = pd.merge(df1, df2, on=
'col1', how=
'outer', indicator=
True)
-
print(res)
-
# col1 col_left col_right _merge
-
# 0 0.0 a NaN left_only
-
# 1 1.0 b 2.0 both
-
# 2 2.0 NaN 2.0 right_only
-
# 3 2.0 NaN 2.0 right_only
-
-
# 自定indicator column的名称,并打印出
-
res = pd.merge(df1, df2, on=
'col1', how=
'outer', indicator=
'indicator_column')
-
print(res)
-
# col1 col_left col_right indicator_column
-
# 0 0.0 a NaN left_only
-
# 1 1.0 b 2.0 both
-
# 2 2.0 NaN 2.0 right_only
-
# 3 2.0 NaN 2.0 right_only
3.4 依据index合并
-
import pandas
as pd
-
-
#定义资料集并打印出
-
left = pd.DataFrame({
'A': [
'A0',
'A1',
'A2'],
-
'B': [
'B0',
'B1',
'B2']},
-
index=[
'K0',
'K1',
'K2'])
-
right = pd.DataFrame({
'C': [
'C0',
'C2',
'C3'],
-
'D': [
'D0',
'D2',
'D3']},
-
index=[
'K0',
'K2',
'K3'])
-
-
print(left)
-
# A B
-
# K0 A0 B0
-
# K1 A1 B1
-
# K2 A2 B2
-
-
print(right)
-
# C D
-
# K0 C0 D0
-
# K2 C2 D2
-
# K3 C3 D3
-
-
#依据左右资料集的index进行合并,how='outer',并打印出
-
res = pd.merge(left, right, left_index=
True, right_index=
True, how=
'outer')
-
print(res)
-
# A B C D
-
# K0 A0 B0 C0 D0
-
# K1 A1 B1 NaN NaN
-
# K2 A2 B2 C2 D2
-
# K3 NaN NaN C3 D3
-
-
#依据左右资料集的index进行合并,how='inner',并打印出
-
res = pd.merge(left, right, left_index=
True, right_index=
True, how=
'inner')
-
print(res)
-
# A B C D
-
# K0 A0 B0 C0 D0
-
# K2 A2 B2 C2 D2
3.5 解决overlapping的问题
下面脚本中,boys和girls均有属性age,但是两者值不同,因此需要在合并时加上后缀suffixes,以示区分。
-
import pandas
as pd
-
-
#定义资料集
-
boys = pd.DataFrame({
'k': [
'K0',
'K1',
'K2'],
'age': [
1,
2,
3]})
-
girls = pd.DataFrame({
'k': [
'K0',
'K0',
'K3'],
'age': [
4,
5,
6]})
-
-
#使用suffixes解决overlapping的问题
-
res = pd.merge(boys, girls, on=
'k', suffixes=[
'_boy',
'_girl'], how=
'inner')
-
print(res)
-
# age_boy k age_girl
-
# 0 1 K0 4
-
# 1 1 K0 5
以上是pandas中有关于合并的一些操作。当然,如果练习的多了,几个方法也是大同小异。