05数据转换学习笔记

最新推荐文章于 2024-11-10 08:00:00 发布

达瓦里氏吨吨吨

最新推荐文章于 2024-11-10 08:00:00 发布

阅读量197

点赞数

CC 4.0 BY-SA版权

分类专栏： Python

本文链接：https://blog.youkuaiyun.com/fantacy10000/article/details/82056375

Python 专栏收录该内容

10 篇文章

订阅专栏

博客展示了食物相关信息，包含食物名称、盎司数，部分还列出了所属动物，如bacon对应pig，Pastrami对应cow等。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一.合并数据集
1.数据库风格的DataFrame合并（merge和join）
import pandas as pd
import numpy as np
tr1=pd.DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})
tr2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
print(tr1)
print(tr2)
tr3=pd.merge(tr1,tr2,on='key',how='outer')
print(tr3)
默认情况下，merge做的是‘inner’连接，结果中的键是交集，而‘outer’连接是外连接，取的是键的并集。其他还有left，right
merge函数的参数P191

   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b

   data2 key
0      0   a
1      1   b
2      2   d

   data1 key  data2
0    0.0   b    1.0
1    1.0   b    1.0
2    6.0   b    1.0
3    2.0   a    0.0
4    4.0   a    0.0
5    5.0   a    0.0
6    3.0   c    NaN
7    NaN   d    2.0

2.索引上的合并
left1=pd.DataFrame({'key':['a','b','c'],'value':range(3)})
right1=pd.DataFrame({'group_val':[3,7]},index=['a','b'])
tr3=pd.merge(left1,right1,left_on='key',right_index=True)
关注这里right_index的使用，因为合并操作时，right1的索引用做了连接键，因此需要使用right_index来进行声明！！！
print(left1)
print(right1)
print(tr3)

  key  value
0   a      0
1   b      1
2   c      2

   group_val
a          3
b          7

  key  value  group_val
0   a      0          3
1   b      1          7

3.轴向连接（就是拼接啊）
3.1 numpy自带的合并原始numpy数组的函数(concatenate)
arr=np.arange(12).reshape((3,4))
print(arr)
arr1=np.concatenate([arr,arr],axis=1)#按列拼接
print(arr1)
arr2=np.concatenate([arr,arr],axis=0)#按行拼接
print(arr2)
3.2 pandas的concat函数
arr1=pd.Series([1,2,3,4],index=['a','b','c','d'])
arr2=pd.Series([1,4],index=['b','c'])
arr3=pd.Series([3,5,6],index=['d','e','f'])
arr4=pd.concat([arr1,arr2,arr3],axis=1)#按行拼接
print(arr4)
concat函数的参数见P198

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
 
[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]]
 
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
 
     0    1    2
a  1.0  NaN  NaN
b  2.0  1.0  NaN
c  3.0  4.0  NaN
d  4.0  NaN  3.0
e  NaN  NaN  5.0
f  NaN  NaN  6.0

4.合并重叠数据
4.1 numpy的where函数(相当于if/else)
a=pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
b=pd.Series(np.arange(len(a),dtype=np.float64),index=['f','e','d','c','b','a'])
c=np.where(pd.isnull(a),b,a)
4.2 Series的combine_first函数效果和上面的where函数一致，而且会实现数据对齐
print(b[:-2].combine_first(a[2:]))

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

二.重塑和轴向旋转
stack()方法：将数据的列“旋转”为行；
unstack()方法：将数据的行“旋转”为列；
result=data.stack()
result.unstack()
注：在进行unstack操作时，作为旋转轴的级别将会成为结果中的最低级别；

三.数据转换
1.移除重复数据（duplicated方法和drop_duplicates方法）
data=pd.DataFrame({'k1':['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})
print(data.duplicated())         返回表示该行是否重复的布尔值
print(data.drop_duplicates())    返回移除重复行后的结果

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4

2.利用函数或映射进行数据转换（针对Series）
就是说可以利用函数来实现根据数据值来实现相关的转换工作,Series的map方法可以接收一个函数或者有映射关系的字典

例1.字典
data=pd.DataFrame({'food':['bacon','pulled pork','bacon','Pastrami'],'ounces':[3, 4, 12, 23]})
print(data)
meat_to_animal={'bacon':'pig','pulled pork':'pig','Pastrami':'cow'}
data['animal']=data['food'].map(str.lower).map(meat_to_animal)
print(data)

food ounces
0 bacon 3
1 pulled pork 4
2 bacon 12
3 Pastrami 23

food ounces animal
0 bacon 3 pig
1 pulled pork 4 pig
2 bacon 12 pig
3 Pastrami 23 cow

例2.函数
data=pd.DataFrame({'food':['bacon','pulled pork','bacon','Pastrami'],'ounces':[3, 4, 12, 23]})
print(data)
meat_to_animal={'bacon':'pig','pulled pork':'pig','pastrami':'cow'}
data['animal']=data['food'].map(lambda x: meat_to_animal[x.lower()])
print(data)

food ounces
0 bacon 3
1 pulled pork 4
2 bacon 12
3 Pastrami 23

food ounces animal
0 bacon 3 pig
1 pulled pork 4 pig
2 bacon 12 pig
3 Pastrami 23 cow

3.替换值（replace）
data=pd.Series([1,2,3,4,-999,-1000,-999,1])
print(data.replace(-999,np.nan))
print(data.replace([-999,-1000],[np.nan,1]))

0       1.0
1       2.0
2       3.0
3       4.0
4       NaN
5   -1000.0
6       NaN
7       1.0
dtype: float64

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    1.0
6    NaN
7    1.0
dtype: float64

4.重命名轴索引
data=pd.DataFrame(np.arange(12).reshape(3,4),index=['ohio','colorado','new york'],columns=['one','two','three','four'])
print(data)
4.1.使用rename方法，创建数据集的转换版（复制版）
print(data.rename(index=str.title,columns=str.upper))
print(data.rename(index={'Ohio':'invida'},columns={'four':'hello'}))
4.2.使用rename方法，就地修改某个数据集（原地版）
data=pd.DataFrame(np.arange(12).reshape(3,4),index=['ohio','colorado','new york'],columns=['one','two','three','four'])
_=data.rename(index={'ohio':'invida'},inplace=True)
print(data)

          one  two  three  four
ohio        0    1      2     3
colorado    4    5      6     7
new york    8    9     10    11

          ONE  TWO  THREE  FOUR
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

          one  two  three  hello
ohio        0    1      2      3
colorado    4    5      6      7
new york    8    9     10     11

          one  two  three  four
invida      0    1      2     3
colorado    4    5      6     7
new york    8    9     10    11

5.离散化和面元划分
直观上理解就是对数据进行区间划分，以获得不同的分组。
ages=[20,22,25,27,23,37,61,45,41,43]
bins=[18,25,35,60,100]
cats=pd.cut(ages,bins)
cut函数的返回值是一个特殊的Categorical对象，其含有一个表示不同分类名称的levels数组以及一个labels属性
cats.labels
cats.levels
前面是指定划分点的，当然也可以给定段数而不给定划分点，其可以自己确定：
data=np.random.rand(20)
pd.cut(data,4,precision=2)
qcut函数是一个类似于cut的函数，其可以根据样本分位数情况进行自动划分，使得划分后各部分的样本数量相同。
data=np.random.randn(1000)#正态分布
cats=pd.qcut(data,4)
同样的，也可以收到设定分位数：
pd.qcut(data,[0,0.1,0.5,0.9.1.0])

6.检测和过滤异常值
data=pd.DataFrame(np.random.randn(1000,4))
data.describe()
col=data[3]
col[np.abs(col)>3]#找出某列中绝对值大小超过3的值
data[(np.abs(data)>3).any(1)]#选出全部含有‘超过3或-3的值’的行

	0	1	2	3
18	0.609418	3.324428	-0.108789	0.171618
99	-1.066103	-0.353316	-0.365273	3.555021
329	-1.657900	-1.991298	0.901694	3.395590
438	-0.473641	1.164948	0.604448	3.638542
500	0.572468	-0.243033	0.293410	-3.010369
513	3.265122	-1.466344	0.467078	-0.578180
614	3.213056	0.014980	0.214329	-0.591175
764	1.928544	-0.904684	0.339387	-3.383201
858	3.284090	0.404298	0.687844	0.414698

7.排列和随机采样
#numpy.random.permutation函数实现对Series或DataFrame的列的排列工作，通过对需要排列的轴的长度调用permutation，可产生一个表示新顺序的整数数组
df=pd.DataFrame(np.arange(5*4).reshape(5,4))
sampler=np.random.permutation(5)
print(sampler)
df.take(sampler)
#相当于permutation基于给定长度产生一组序列，而后的数组基于该序列，采样take函数来对数组重新进行排列

[2 3 0 4 1]

	0	1	2	3
2	8	9	10	11
3	12	13	14	15
0	0	1	2	3
4	16	17	18	19
1	4	5	6	7

四.字符串操作
1.python自有的字符串处理方法：split，strip，+，join，in，index，find，count，replace

2.正则表达式

3.pandas中矢量化的字符串函数P224