pandas数据组合

最新推荐文章于 2024-07-25 23:19:30 发布

嘉嘉嘉Jessie

最新推荐文章于 2024-07-25 23:19:30 发布

阅读量389

点赞数

文章标签： pandas python 数据分析

本文链接：https://blog.youkuaiyun.com/weixin_49588247/article/details/130919969

版权

本文详细介绍了如何使用Pandas进行数据组合，包括连接数据的多种方式：添加行、添加列以及处理不同行列索引的数据。讲解了pd.concat、append、merge和join等方法的用法，并给出了具体示例，帮助读者掌握数据合并技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据组合

学习目标

熟练使用Pandas连接数据
熟练使用Pandas合并数据集

1 简介

在动手进行数据分析工作之前，需要进行数据清理工作，数据清理的主要目标是
- 每个观测值成一行
- 每个变量成一列
- 每种观测单元构成一张表格
数据整理好之后，可能需要多张表格组合到一起才能进行某些问题的分析
- 一张表保存公司名称，另一张表保存股票价格
- 单个数据集也可能会分割成多个，比如时间序列数据，每个日期可能在一个单独的文件中

2 连接数据

组合数据的一种方法是使用“连接”（concatenation)
- 连接是指把某行或某列追加到数据中
- 数据被分成了多份可以使用连接把数据拼接起来
- 把计算的结果追加到现有数据集，可以使用连接

2.1 添加行

pd.concat([df1, df2, …],axis=index)

加载多份数据，并连接起来

import pandas as pd
df1 = pd.read_csv('data/concat_1.csv')
df2 = pd.read_csv('data/concat_2.csv')
df3 = pd.read_csv('data/concat_3.csv')
print(df1)

显示结果：

A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3

print(df2)

显示结果：

A   B   C   D
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7

print(df3)

显示结果：

  A    B    C    D
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

可以使用concat函数将上面3个DataFrame连接起来，需将3个DataFrame放到同一个列表中

row_concat = pd.concat([df1,df2,df3])
print(row_concat)

显示结果：

  A    B    C    D
0   a0   b0   c0   d0
1   a1   b1   c1   d1
2   a2   b2   c2   d2
3   a3   b3   c3   d3
0   a4   b4   c4   d4
1   a5   b5   c5   d5
2   a6   b6   c6   d6
3   a7   b7   c7   d7
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

从上面的结果中可以看到，concat函数把3个DataFrame连接在了一起（简单堆叠），可以通过 iloc ,loc等方法取出连接后的数据的子集

row_concat.iloc[3,] # iloc[行号]

显示结果：

A    a3
B    b3
C    c3
D    d3
Name: 3, dtype: object

row_concat.loc[3,]  # loc[行索引]

显示结果：

A    B    C    D
3   a3   b3   c3   d3
3   a7   b7   c7   d7
3  a11  b11  c11  d11

使用concat连接DataFrame和Series

new_series = pd.Series(['n1','n2','n3','n4'])
print(new_series)

显示结果：

0    n1
1    n2
2    n3
3    n4
dtype: object

pd.concat([df1,new_series])

显示结果：

A B C D 0

0 a0 b0 c0 d0 NaN

1 a1 b1 c1 d1 NaN

2 a2 b2 c2 d2 NaN

3 a3 b3 c3 d3 NaN

0 NaN NaN NaN NaN n1

1 NaN NaN NaN NaN n2

2 NaN NaN NaN NaN n3

3 NaN NaN NaN NaN n4

	A	B	C	D	0
0	a0	b0	c0	d0	NaN
1	a1	b1	c1	d1	NaN
2	a2	b2	c2	d2	NaN
3	a3	b3	c3	d3	NaN
0	NaN	NaN	NaN	NaN	n1
1	NaN	NaN	NaN	NaN	n2
2	NaN	NaN	NaN	NaN	n3
3	NaN	NaN	NaN	NaN	n4

上面的结果中包含NaN值，NaN是Python用于表示“缺失值”的方法，由于Series是一维数据，concat方法默认是在下方添行（axis=index），按行添加，把Series当成列对待，由于Series数据没有列索引，所以添加了一个新列，缺失的数据用NaN填充。
如果想将[‘n1’,‘n2’,‘n3’,‘n4’]作为行连接到df1后，可以创建DataFrame并指定列名

# 注意[['n1','n2','n3','n4']] 是两个中括号
new_row_df = pd.DataFrame([['n1','n2','n3','n4']],columns=['A','B','C','D']) 
# 数据规整，给数据添加列索引
print(new_row_df)
# new_row_df 为df类型

显示结果：

  A   B   C   D
0  n1  n2  n3  n4

print(pd.concat([df1,new_row_df]))

因为：此时df1的列索引和 new_row_df（df类型）的列索引一致！

显示结果：

  A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
0  n1  n2  n3  n4

df1.append(df2/字典, ignore_index=)

concat可以连接多个对象，如果只需要向现有DataFrame追加一个对象，可以通过append函数来实现

print(df1.append(df2))

显示结果：

  A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7

print(df1.append(new_row_df))
# 上面数据规整后new_row_df的列索引和df1一致

显示结果：

 A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
0  n1  n2  n3  n4

ignore_index参数：True 忽略旧索引，重置新索引

使用Python字典添加数据行

data_dict = {
   'A':'n1','B':'n2','C':'n3','D':'n4'}
df1.append(data_dict,ignore_index=True)
# ignore_index=True 设置忽略索引，则给data_dict设置新的索引

显示结果：

 A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
4  n1  n2  n3  n4

上面的例子中，向DataFrame中append一个字典的时候，必须传入ignore_index = True 否则会报错。
如果是两个或者多个DataFrame连接，可以通过ignore_index = True参数，忽略后面DataFrame的索引

row_concat_ignore_index = pd.concat([df1,df2,df3],ignore_index=True)
print(row_concat_ignore_index)

显示结果：

    A    B    C    D
0    a0   b0   c0   d0
1    a1   b1   c1   d1
2    a2   b2   c2   d2
3    a3   b3   c3   d3
4    a4   b4   c4   d4
5    a5   b5   c5   d5
6    a6   b6   c6   d6
7    a7   b7   c7   d7
8    a8   b8   c8   d8
9    a9   b9   c9   d9
10  a10  b10  c10  d10
11  a11  b11  c11  d11

2.2 添加列

pd.concat([df1, df2,…],axis=columns)

axis参数默认index添加行,columns添加列

使用concat函数添加列，与添加行的方法类似，需要多传一个axis参数 axis的默认值0是index 按行添加，传入参数 axis = columns / 1 即可按列添加

col_concat = pd.concat([df1,df2,df3],axis=1)
print(col_concat)

显示结果：

 A   B   C   D   A   B   C   D    A    B    C    D
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11

通过列名获取子集

print(col_concat['A'])

显示结果：

 A   A    A
0  a0  a4   a8
1  a1  a5   a9
2  a2  a6  a10
3  a3  a7  a11

df[‘列名’] = [‘列值’]

向DataFrame添加一列，不需要调用函数，通过dataframe[‘列名’] = [‘值’] 即可

col_concat['new_col'] =

最低0.47元/天解锁文章

	A	B	C	D	0
0	a0	b0	c0	d0	NaN
1	a1	b1	c1	d1	NaN
2	a2	b2	c2	d2	NaN
3	a3	b3	c3	d3	NaN
0	NaN	NaN	NaN	NaN	n1
1	NaN	NaN	NaN	NaN	n2
2	NaN	NaN	NaN	NaN	n3
3	NaN	NaN	NaN	NaN	n4

	A	B	C	D	0
0	a0	b0	c0	d0	NaN
1	a1	b1	c1	d1	NaN
2	a2	b2	c2	d2	NaN
3	a3	b3	c3	d3	NaN
0	NaN	NaN	NaN	NaN	n1
1	NaN	NaN	NaN	NaN	n2
2	NaN	NaN	NaN	NaN	n3
3	NaN	NaN	NaN	NaN	n4

pandas数据组合

数据组合

学习目标

1 简介

2 连接数据

2.1 添加行

pd.concat([df1, df2, …],axis=index)

df1.append(df2/字典, ignore_index=)

ignore_index参数：True 忽略旧索引，重置新索引

2.2 添加列

pd.concat([df1, df2,…],axis=columns)

axis参数 默认index添加行,columns添加列

df[‘列名’] = [‘列值’]

axis参数默认index添加行,columns添加列

	A	B	C	D	0
0	a0	b0	c0	d0	NaN
1	a1	b1	c1	d1	NaN
2	a2	b2	c2	d2	NaN
3	a3	b3	c3	d3	NaN
0	NaN	NaN	NaN	NaN	n1
1	NaN	NaN	NaN	NaN	n2
2	NaN	NaN	NaN	NaN	n3
3	NaN	NaN	NaN	NaN	n4