ch8_02 数据规整：聚合、合并、重塑

最新推荐文章于 2025-05-25 10:41:02 发布

原创最新推荐文章于 2025-05-25 10:41:02 发布 · 451 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#pandas #数据清洗 #数据聚合

利用Python进行数据分析专栏收录该内容

21 篇文章

订阅专栏

本文深入探讨了Pandas库中的高级数据处理技巧，包括数据的连接、合并、重塑以及处理重叠数据集的方法。通过实例展示了如何使用concat、combine_first、stack和unstack等函数来操作数据，实现数据的高效管理和分析。

【Jupyter notebook】更好的阅读体验！

接上一部分

轴向连接

另一种数据合并运算也被称作连接（concatenation）、绑定（binding）或堆叠（stacking）。NumPy的concatenation函数可以用NumPy数组来做：

import numpy as np
import pandas as pd

arr = np.arange(12).reshape((3,4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

np.concatenate([arr,arr],axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

np.concatenate([arr,arr],axis=0)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

pandas 的concat()函数

s1 = pd.Series([0,1], index=['a','b'])
s2 = pd.Series([2,3,4],index= ['c','d','e'])
s3 = pd.Series([5,6],index=['f','g'])

# 使用concat可以将这些Series的值和索引连接起来
pd.concat([s1,s2,s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

默认情况下是axis=0,也可以传入参数在列上合并，这样结果就会变成一个DataFrame

pd.concat([s1,s2,s3],axis=1,sort=False)

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

s4 = pd.concat([s1,s3])
s4

a    0
b    1
f    5
g    6
dtype: int64

pd.concat([s1,s4],axis=1,sort=False)

	0	1
a	0.0	0
b	1.0	1
f	NaN	5
g	NaN	6

pd.concat([s1,s4],axis=1,join='inner')

	0	1
a	0	0
b	1	1

通过join_axes指定要在其它轴上使用的索引：

pd.concat([s1,s4],axis=1, join_axes=[['a','c','b','e']])

	0	1
a	0.0	0.0
c	NaN	NaN
b	1.0	1.0
e	NaN	NaN

#不过有个问题，参与连接的片段在结果中区分不开。
#假设你想要在连接轴上创建一个层次化索引。使用keys参数即可达到这个目的：
result = pd.concat([s1,s2,s3],keys = ['one','two','three'])
result

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

result.unstack()

	a	b	c	d	e	f	g
one	0.0	1.0	NaN	NaN	NaN	NaN	NaN
two	NaN	NaN	2.0	3.0	4.0	NaN	NaN
three	NaN	NaN	NaN	NaN	NaN	5.0	6.0

如果沿着axis=1 进行合并，则keys就会成为列头

pd.concat([s1,s2,s3],axis=1,keys=['one','two','three'],sort=False)

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

同样的逻辑也适用于DataFrame

df1 = pd.DataFrame(np.arange(6).reshape((3,2)),
                   index = ['a','b','c'],
                   columns=['one','two'])
df2 = pd.DataFrame(5+np.arange(4).reshape((2,2)),
                  index = ['a','c'],
                  columns = ['three','four'])
df1

	one	two
a	0	1
b	2	3
c	4	5

df2

	three	four
a	5	6
c	7	8

pd.concat([df1,df2],axis=1,keys=['level1','level2'],sort=False)

	level1		level2
	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

如果传入的不是一个列表，而是一个字典，那么字典的键就会被当作keys的选项

pd.concat({'level1':df1, 'level2':df2},axis=1,sort=False)

	level1		level2
	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

此外还可以传入names参数

pd.concat([df1,df2],axis=1,keys=['level1','level2'],names=['upper','lower'],sort=False)

upper	level1		level2
lower	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

concat的参数如下：

合并重叠数据集

a = pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
b = pd.Series(np.arange(len(a),dtype=np.float64),index=a.index)

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

b[-1]=np.nan
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

np.where(pd.isnull(a),b,a)

array([0. , 2.5, 2. , 3.5, 4.5, nan])

b[:-2].combine_first(a[1:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

注： np.where(condition, x, y)满足condition则输出x,否则输出y.上面的代码：如果a的元素是nan则输出b对应的元素
对于DataFrame，combine_first会做同样的事情

df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})
df1

	a	b	c
0	1.0	NaN	2
1	NaN	2.0	6
2	5.0	NaN	10
3	NaN	6.0	14

df2

	a	b
0	5.0	NaN
1	4.0	3.0
2	NaN	4.0
3	3.0	6.0
4	7.0	8.0

df1.combine_first(df2)

	a	b	c
0	1.0	NaN	2.0
1	4.0	2.0	6.0
2	5.0	4.0	10.0
3	3.0	6.0	14.0
4	7.0	8.0	NaN

8.3重塑和轴向旋转

重塑层次化索引

主要方法有：
- stack():将列旋转为行
- unstack():将行旋转为列

data = pd.DataFrame(np.arange(6).reshape((2,3)),
                   index=pd.Index(['Ohio','Colordo'],name='state'),
                   columns=pd.Index(['one','two','three'],name='number'))

data

number	one	two	three
state
Ohio	0	1	2
Colordo	3	4	5

result = data.stack()
result

state    number
Ohio     one       0
         two       1
         three     2
Colordo  one       3
         two       4
         three     5
dtype: int32

result.unstack()

number	one	two	three
state
Ohio	0	1	2
Colordo	3	4	5

默认情况下，unstack操作的是最内层（stack也是如此）。传入分层级别的编号或名称即可对其它级别进行unstack操作：

result.unstack(0)

state	Ohio	Colordo
number
one	0	3
two	1	4
three	2	5

result.unstack('state')

state	Ohio	Colordo
number
one	0	3
two	1	4
three	2	5

如果有的值找不到，就会引入缺失数据

s1 = pd.Series([0,1,2,3],index=['a','b','c','d'])
s2 = pd.Series([4,5,6],index=['c','d','e'])
data = pd.concat([s1,s2],keys=['one','two'])
data

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

data.unstack()

	a	b	c	d	e
one	0.0	1.0	2.0	3.0	NaN
two	NaN	NaN	4.0	5.0	6.0

data.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

data.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

DataFrame进行unstack操作时，作为旋转轴的级别将会成为结果中的最低级别：

df = pd.DataFrame({'left': result, 'right': result + 5},
                  columns=pd.Index(['left', 'right'], name='side'))
df

	side	left	right
state	number
Ohio	one	0	5
	two	1	6
	three	2	7
Colordo	one	3	8
	two	4	9
	three	5	10

df.unstack('state')

side	left		right
state	Ohio	Colordo	Ohio	Colordo
number
one	0	3	5	8
two	1	4	6	9
three	2	5	7	10

# 调用stack()可以指明轴的名字
df.unstack('state').stack('side')

	state	Colordo	Ohio
number	side
one	left	3	0
one	right	8	5
two	left	4	1
two	right	9	6
three	left	5	2
three	right	10	7

将长格式转换为宽格式

多个时间序列数据通常是以所谓的“长格式”（long）或“堆叠格式”（stacked）存储在数据库和CSV中的。

data = pd.read_csv('data/examples/macrodata.csv')
data.head()

	year	quarter	realgdp	realcons	realinv	realgovt	realdpi	cpi	m1	tbilrate	unemp	pop	infl	realint
0	1959.0	1.0	2710.349	1707.4	286.898	470.045	1886.9	28.98	139.7	2.82	5.8	177.146	0.00	0.00
1	1959.0	2.0	2778.801	1733.7	310.859	481.301	1919.7	29.15	141.7	3.08	5.1	177.830	2.34	0.74
2	1959.0	3.0	2775.488	1751.8	289.226	491.260	1916.4	29.35	140.5	3.82	5.3	178.657	2.74	1.09
3	1959.0	4.0	2785.204	1753.7	299.356	484.052	1931.3	29.37	140.0	4.33	5.6	179.386	0.27	4.06
4	1960.0	1.0	2847.699	1770.5	331.722	462.199	1955.5	29.54	139.6	3.50	5.2	180.007	2.31	1.19

periods = pd.PeriodIndex(year=data.year,quarter=data.quarter,name='data')
columns = pd.Index(['realgdp','infl','unemp'],name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D','end')
ldata = data.stack().reset_index().rename(columns={0:'value'})

ldata[:5]

	data	item	value
0	1959-03-31	realgdp	2710.349
1	1959-03-31	infl	0.000
2	1959-03-31	unemp	5.800
3	1959-06-30	realgdp	2778.801
4	1959-06-30	infl	2.340

pivoted = ldata.pivot('data','item','value')

pivoted[:5]

item	infl	realgdp	unemp
data
1959-03-31	0.00	2710.349	5.8
1959-06-30	2.34	2778.801	5.1
1959-09-30	2.74	2775.488	5.3
1959-12-31	0.27	2785.204	5.6
1960-03-31	2.31	2847.699	5.2

将宽格式旋转为长格式

df = pd.DataFrame({'key':['foo','bar','baz'],
                  'A':[1,2,3],
                  'B':[4,5,6],
                  'C':[7,8,9]})
df

	key	A	B	C
0	foo	1	4	7
1	bar	2	5	8
2	baz	3	6	9

melted = pd.melt(df,['key'])
melted

	key	variable	value
0	foo	A	1
1	bar	A	2
2	baz	A	3
3	foo	B	4
4	bar	B	5
5	baz	B	6
6	foo	C	7
7	bar	C	8
8	baz	C	9

reshaped = melted.pivot('key','variable','value')
reshaped

variable	A	B	C
key
bar	2	5	8
baz	3	6	9
foo	1	4	7

reshaped.reset_index()

variable	key	A	B	C
0	bar	2	5	8
1	baz	3	6	9
2	foo	1	4	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0