结合和合并数据集
(1)类似数据库的DataFrame合并,基于key
基于一个或多个key,对数据集进行merge或join操作。
若有以下数据集:
In [15]: df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
In [16]: df2 = DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
In [17]: df1 In [18]: df2
Out[17]: Out[18]:
data1 key data2 key
0 0 b 0 0 a
1 1 b 1 1 b
2 2 a 2 2 d
3 3 c
4 4 a
5 5 a
6 6 b
如果直接使用merge方法,不使用参数,将默认按照两个数据集中相同名称的列进行合并。可以通过 on 参数进行指定,以下代码结果等同于直接pd.merge(df1,df2)
In [20]: pd.merge(df1, df2, on='key')
Out[20]: data1 key data2
0 2 a 0
1 4 a 0
2 5 a 0
3 0 b 1
4 1 b 1
5 6 b 1
【两个数据对象中,不同列名合并】分别使用left_on和right_on参数指定
In [21]: df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
In [22]: df4 = DataFrame({<span style="color:#000000;background-color: rgb(255, 255, 255);">'rkey'</span>: ['a', 'b', 'd'], 'data2': range(3)})
In [23]: pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[23]:
data1 lkey data2 rkey
0 2 a 0 a
1 4 a 0 a
2 5 a 0 a
3 0 b 1 b
4 1 b 1 b
5 6 b 1 b
merge默认是按照inner join来合并数据的,可以使用how参数来指定为"outer","left","right"
In [24]: pd.merge(df1, df2, how='outer')
Out[24]: data1 key data2
0 2 a 0
1 4 a 0
2 5 a 0
3 0 b 1
4 1 b 1
5 6 b 1
6 3 c NaN
7 NaN d 2
如果是根据多个key来合并,则可以使用一个list类型参数来指定on
In [31]: left = DataFrame({'key1': ['foo', 'foo', 'bar'], 'key2': ['one', 'two', 'one'],'lval': [1, 2, 3]})
In [32]: right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 'key2': ['one', 'one', 'one', 'two'], 'rval': [4, 5, 6, 7]})
In [33]: pd.merge(left, right, on=['key1', 'key2'], how='outer')
Out[33]:
key1 key2 lval rval
0 bar one 3 6
1 bar two NaN 7
2 foo one 1 4
3 foo one 1 5
4 foo two 2 NaN
(2)基于index的数据集合并
In [36]: left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})
In [37]: right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
In [38]: left1 In [39]: right1
Out[38]: Out[39]:
key value group_val
0 a 0 a 3.5
1 b 1 b 7.0
2 a 2
3 a 3
4 b 4
5 c 5
上例数据中right1指定了index,可以使用参数left_index=True或right_index=True来代替left_on='key1'和right_on=‘key2’
In [40]: pd.merge(left1, right1, left_on='key', right_index=True)
Out[40]: key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
对于【层次索引】,则需要使用list类型作为参数
In [42]: lefth = DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'key2': [2000, 2001, 2002, 2001, 2002], 'data': np.arange(5.)})
In [43]: righth = DataFrame(np.arange(12).reshape((6, 2)), index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'], [2001, 2000, 2000, 2000, 2001, 2002]], columns=['event1', 'event2'])
In [44]: lefth In [45]: righth
Out[44]: Out[45]:
data key1 key2 event1 event2
0 0 Ohio 2000 Nevada 2001 0 1
1 1 Ohio 2001 2000 2 3
2 2 Ohio 2002 Ohio 2000 4 5
3 3 Nevada 2001 2000 6 7
4 4 Nevada 2002 2001 8 9
2002 10 11
(3)使用更加方便的join方法,该方法适用于多个DataFrame类型数据集,同时这些DataFrame之间有相同或相似的index,但是没有重叠的列名。
有如下两组数据:
In [50]: left2 In [51]: right2
Out[50]: Out[51]:
Ohio Nevada Missouri Alabama
a 1 2 b 7 8
c 3 4 c 9 10
e 5 6 d 11 12
e 13 14
使用join方法,同时可以指定合并的方式 ,以how参数调整(=‘outer’相当于外连接)
In [53]: left2.join(right2, how='outer')
Out[53]: Ohio Nevada Missouri Alabama
a 1 2 NaN NaN
b NaN NaN 7 8
c 3 4 9 10
d NaN NaN 11 12
e 5 6 13 14
还可以使用参数 on来明确某列作为合并的索引,注意是调用join的那个DataFrame对象中的列才行,如下列代码中的left1对象中的列“key”
In [54]: left1.join(right1, on='key')
Out[54]: key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN
如果【有多个DataFrame对象一起进行join】操作,只需要将他们传入一个list对象中即可。
In [55]: another = DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]], index=['a', 'c', 'e', 'f'], columns=['New York', 'Oregon'])
In [57]: left2.join([right2, another], how='outer')
Out[57]: Ohio Nevada Missouri Alabama New York Oregon
a 1 2 NaN NaN 7 8
b NaN NaN 7 8 NaN NaN
c 3 4 9 10 9 10
d NaN NaN 11 12 NaN NaN
e 5 6 13 14 11 12
f NaN NaN NaN NaN 16 17
(4)根据轴(横向或纵向)连接数据集。
Numpy有concatenate方法,默认按照axis=0连接,如下:
In [58]: arr = np.arange(12).reshape((3, 4))
In [59]: arr
Out[59]: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [60]: np.concatenate([arr, arr], axis=1)
Out[60]: array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
pandas有concat方法,假设下列三个Series对象,他们之间没有重叠列
In [61]: s1 = Series([0, 1], index=['a', 'b'])
In [62]: s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
In [63]: s3 = Series([5, 6], index=['f', 'g'])
调用cancat方法,传入的参数为list类型
In [64]: pd.concat([s1, s2, s3])
Out[64]: a 0
b 1
c 2
d 3
e 4
f 5
g 6
如果其他轴上没有相交的,可以通过参数【join改变交叉方式】,如下:
In [66]: s4 = pd.concat([s1 * 5, s3])
In [67]: pd.concat([s1, s4], axis=1)
In [68]: pd.concat([s1, s4], axis=1, join='inner')
Out[67]: Out[68]:
0 1 0 1
a 0 0 a 0 0
b 1 5 b 1 5
f NaN 5
g NaN 6
如果想【指定和另外一个对象相交叉的轴】,可以使用join_axes参数指定
In [69]: pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Out[69]: 0 1
a 0 0
c NaN NaN
b 1 5
e NaN NaN
如果想【创建层次索引】,可以使用keys参数
In [70]: result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])
In [71]: result Out[71]: one a 0
b 1
two a 0
b 1
three f 5
g 6
如果传入参数 axis=1,同时指定keys。keys则演变成结果集中的列头
In [73]: pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])
Out[73]: one two three
a 0 NaN NaN
b 1 NaN NaN
c NaN 2 NaN
d NaN 3 NaN
e NaN 4 NaN
f NaN NaN 5
g NaN NaN 6
(5)合并重叠数据
Numpy的where方法相当于一个if-else逻辑。
In [87]: a In [88]: b In [89]: np.where(pd.isnull(a), b, a)
Out[87]: Out[88]: Out[89]:
f NaN f 0 f 0.0
e 2.5 e 1 e 2.5
d NaN d 2 d 2.0
c 3.5 c 3 c 3.5
b 4.5 b 4 b 4.5
a NaN a NaN a NaN
对于Series类型数据对象,使用combine_first方法。以下代码 截取了b的0至倒数第2行,a的第3行到最后一行,重叠部分,取b的值
In [90]: b[:-2].combine_first(a[2:])
Out[90]: a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
对于DataFrame类型数据对象,combine_first方法逐列执行同样的逻辑
In [91]: df1 = DataFrame({'a': [1., np.nan, 5., np.nan], 'b': [np.nan, 2., np.nan, 6.], 'c': range(2, 18, 4)})
In [92]: df2 = DataFrame({'a': [5., 4., np.nan, 3., 7.], 'b': [np.nan, 3., 4., 6., 8.]})
In [93]: df1.combine_first(df2)
Out[93]: a b c
0 1 NaN 2
1 4 2 6
2 5 4 10
3 3 6 14
4 7 8 NaN