Pandas介绍
Python长期以来一直非常适合数据整理和准备,但对于数据分析和建模则不那么重要。pandas有助于填补这一空白,使您能够在Python中执行整个数据分析工作流程,而无需切换到更像域的特定语言,如R. pandas是一个Python包,提供快速,灵活和富有表现力的数据结构
创建对象
导入包:
import numpy as np
import pandas as pd
创建Series
- 和 NumPy 数组不同,Pandas 的 Series 能存放各种不同类型的对象。
- 从字典创建Series,pandas 会自动把字典的键值设置成 Series 的 index,并将对应的 values 放在和索引对应的 data 里。
例子代码:
"""创建Series"""
a=pd.Series([12,4,5,np.NAN,23]) #可以添加index=,但是不能添加columns=
a2=pd.Series(1,index=[0,1,2,3],dtype="float32") #不够的自动补齐
b_dict={'a':50,'b':24,'c':89,'d':34}
b=pd.Series(b_dict)
type(b) #<class 'pandas.core.series.Series'>
print(a,a2,b)
"""
0 12.0
1 4.0
2 5.0
3 NaN
4 23.0
dtype: float64
0 1.0
1 1.0
2 1.0
3 1.0
dtype: float32
a 50
b 24
c 89
d 34
dtype: int64
"""
创建Dataframe
例子代码:
c=pd.DataFrame([[23,5,6],[2,3,5]])
d=pd.DataFrame([[23,5,6],[2,3,5]],index=["第一行","第二行"],columns=["第一列","第二列","第三列"])#或者用下面的np数组创建:
e=pd.DataFrame(np.array([[23,5,6],[2,3,5]]),index=["第一行","第二行"],columns=["第一列","第二列","第三列"])
print(c,"\n\n",d,"\n\n",e)
"""
0 1 2
0 23 5 6
1 2 3 5
第一列 第二列 第三列
第一行 23 5 6
第二行 2 3 5
第一列 第二列 第三列
第一行 23 5 6
第二行 2 3 5
"""
"""通过传递带有日期时间索引和标记列的NumPy数组来创建"""
dates = pd.date_range('20130101', periods=6)
f=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list("ABCD"))
print(dates,f)
"""
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
A B C D
2013-01-01 1.176395 -0.566222 -0.995993 0.686690
2013-01-02 -0.161430 -0.020617 -1.256424 1.061952
2013-01-03 0.831867 -0.814536 1.846343 -1.006396
2013-01-04 0.912144 -0.007161 -0.344765 -0.567658
2013-01-05 0.289082 -0.173980 -0.410463 -0.895588
2013-01-06 -0.142269 0.380692 -0.691355 0.474019
"""
"""通过传递可以转换为类似系列的对象的dict来创建。"""
g=pd.DataFrame({'A':3.,'B':pd.Series(1,index=[0,1,2,3],dtype="float32"),
'C':np.array([4,23,2,2],dtype="int32")}) #A列数目不够,自动填充补齐
print(g,"\n",g.dtypes)
"""
A B C
0 3.0 1.0 4
1 3.0 1.0 23
2 3.0 1.0 2
3 3.0 1.0 2
A float64
B float32
C int32
dtype: object
"""
数据查看
- Dataframe.head(),Dataframe.tail(),输出头几行或后几行
- Dataframe.index,Dataframe.columns,显示index和columns
- Dataframe.describe(),显示数据的快速统计摘要(计数、均值、最大最小值、四分位数)
- Dataframe.T,转置
- Dataframe.sort_index(axis=1,ascending=False) ,按轴排序
- Dataframe.sort_values(by=‘B’) ,按照列值排序('B’列)
- Dataframe.to_numpy(),转成numpy数组
例子代码:
a=pd.DataFrame({'A':[1,2,4,5,6],'B':[3,4,6,2,5],'C':[3,5,24,1,2]},index=["first","second","third","fourth","fifth"])
a.head(1)
a.tail(2)
print(a.index,"\n",a.columns)
a.describe()
a.T
a.sort_index(axis=1,ascending=False)
a.sort_index(axis=0,ascending=True)
a.sort_values('B')
a.to_numpy()
"""
A B C
first 1 3 3
A B C
fourth 5 2 1
fifth 6 5 2
Index(['first', 'second', 'third', 'fourth', 'fifth'], dtype='object')
Index(['A', 'B', 'C'], dtype='object')
A B C
count 5.000000 5.000000 5.000000
mean 3.600000 4.000000 7.000000
std 2.073644 1.581139 9.617692
min 1.000000 2.000000 1.000000
25% 2.000000 3.000000 2.000000
50% 4.000000 4.000000 3.000000
75% 5.000000 5.000000 5.000000
max 6.000000 6.000000 24.000000
first second third fourth fifth
A 1 2 4 5 6
B 3 4 6 2 5
C 3 5 24 1 2
C B A
first 3 3 1
second 5 4 2
third 24 6 4
fourth 1 2 5
fifth 2 5 6
A B C
fifth 6 5 2
first 1 3 3
fourth 5 2 1
second 2 4 5
third 4 6 24
A B C
fourth 5 2 1
first 1 3 3
second 2 4 5
fifth 6 5 2
third 4 6 24
array([[ 1, 3, 3],
[ 2, 4, 5],
[ 4, 6, 24],
[ 5, 2, 1],
[ 6, 5, 2]], dtype=int64)
"""
数据选择
- Dataframe[‘A’],选择一个名为”A“的 列,产生一个Series
- Dataframe[0:2],对 行 进行切片
- DataFrame.loc, 通过标签或布尔数字访问行和列
- DataFrame.iloc, 按照位置选择
- DataFrame.iat,快速访问标量
- 布尔索引:
- DataFrame[DataFrame.A > 0],通过单个列的值来选择数据
- DataFrame[DataFrame[“A”].isin([0,1]) 选取A列值为0或1的
例子代码:
a=pd.DataFrame({'A':[1,2,4,5,6],'B':[3,4,6,2,5],'C':[3,5,24,1,2]},index=["first","second","third","fourth","fifth"])
b=a["A"]
c=a[0:2]
d=a["first":"third"]
print(b,c,d)
"""
first 1
second 2
third 4
fourth 5
fifth 6
Name: A, dtype: int64
A B C
first 1 3 3
second 2 4 5
A B C
first 1 3 3
second 2 4 5
third 4 6 24
"""
"""进行切片查找"""
e=a.loc["first"]
f=a.loc[["first","second"]]
g=a.loc[:,"A"]
h=a.loc[:,["A","B"]]
i=a.loc[["first","fifth"],["A","C"]]
print(e,f,g,h,i)
"""
A 1
B 3
C 3
Name: first, dtype: int64
A B C
first 1 3 3
second 2 4 5
first 1
second 2
third 4
fourth 5
fifth 6
Name: A, dtype: int64
A B
first 1 3
second 2 4
third 4 6
fourth 5 2
fifth 6 5
A C
first 1 3
fifth 6 2
"""
"""按照位置查找"""
j=a.iloc[3] #选取第4行
k=a.iloc[3:4] #左开右闭,还是选取第4行
l=a.iloc[[1,3,4],[0,2]]
m=a.iloc[:,[1,2]]
"""
A 5
B 2
C 1
Name: fourth, dtype: int64
A B C
fourth 5 2 1
A C
second 2 5
fourth 5 1
fifth 6 2
B C
first 3 3
second 4 5
third 6 24
fourth 2 1
fifth 5 2
"""
n=a.iat[2,2] #查找标量
o=a[a["A"].isin([0,1])] #选取A列值为0或者1的
p=a[a.A>0] #布尔值查找
"""
24
A B C
first 1 3 3
A B C
first 1 3 3
second 2 4 5
third 4 6 24
fourth 5 2 1
fifth 6 5 2
"""
设定值
例子代码:
a=pd.DataFrame({'A':[1,2,4,5,6],'B':[3,4,6,2,5],'C':[3,5,24,1,2]},index=["first","second","third","fourth","fifth"])
print(a)
"""
A B C
first 1 3 3
second 2 4 5
third 4 6 24
fourth 5 2 1
fifth 6 5 2"""
b=pd.Series([23,4,5,6,3],index=["second","third","fourth","fifth","six"])
a["D"]=b #会自动根据index对齐数据,其余的扔掉
a.at["first","A"]=0 """按照标签设定值"""
a.iat[0,1]=0 """按照位置设定值"""
a.loc[:,"F"]=np.arange(5) """使用numpy数组设定值"""
print(a)
"""
A B C D F
first 0 0 3 NaN 0
second 2 4 5 23.0 1
third 4 6 24 4.0 2
fourth 5 2 1 5.0 3
fifth 6 5 2 6.0 4"""
缺少数据处理
- pandas使用np.nan来表示缺失的数据
- Dataframe.dropna(how=“any”),删除任何缺少数据的行
- Dataframe.fillna(value=5),填写缺失的数据
- pandas.isna(Dataframe),判断是否为nan值,返回布尔值
例子代码:
a=pd.DataFrame([[12,23,52,np.nan],[23,np.nan,2,5],[32,5,3,2]],columns=["A","B","C","D"])
print(a)
"""
A B C D
0 12 23.0 52 NaN
1 23 NaN 2 5.0
2 32 5.0 3 2.0"""
b=a.dropna(how="any") #等价于b=a.dropna()
c=a.fillna(value=100)
d=pd.isna(a)
print(b,c,d)
"""
A B C D
2 32 5.0 3 2.0
A B C
first 1 3 3
second 2 4 5
A B C D
0 False False False True
1 False True False False
2 False False False False"""
一些操作(通常排除丢失的数据)
- 执行描述性统计:对列:df.mean()、对行:df.mean(1)
- 将函数应用于数据(申请),如:df.apply(lambda x: x.max() - x.min())
- 直方图:Dataframe.value_counts()
- 字符串方法,如:Dataframe.str.lower()/upper()/len()…
例子代码:
a=pd.DataFrame([[23,3,5,2,6],[2,3,5,6,3],[34,6,5,7,4]])
b=a.mean() #对列进行求均值
c=a.mean(1) #对行进行求均值
d=a.apply(lambda x:x.max()-x.min()) #可以加上axis=参数来改变函数应用方向
print(b,c,d)
"""
A 19.666667
B 4.000000
C 5.000000
D 5.000000
E 4.333333
dtype: float64
0 7.8
1 3.8
2 11.2
dtype: float64
A 32
B 3
C 0
D 5
E 3
dtype: int64"""
e=pd.Series([234,23,5,2,7,4,5,5,6,7,76,6,3,3])
f=e.value_counts() #直方图
print(f)
"""
5 3
6 2
3 2
7 2
76 1
234 1
23 1
4 1
2 1
dtype: int64"""
合并
- pd.concat((Dataframe1/Series1,Dataframe2/Series2…)),将pandas对象(Series、Dataframe)连接在一起
- pd.merge(),SQL样式合并
- df.append(),追加
例子代码:
a=pd.DataFrame([[23,3,5,2,6],[2,3,5,6,3],[34,6,5,7,4]],columns=["A","B","C","D","E"])
b=a.loc[0:1,"A"]
c=a.iloc[0:2,[1,2]]
print(a,"\n\n",b,"\n\n",c)
"""
A B C D E
0 23 3 5 2 6
1 2 3 5 6 3
2 34 6 5 7 4
0 23
1 2
Name: A, dtype: int64
B C
0 3 5
1 3 5"""
print(a.append(a.iloc[1,:],ignore_index=True))
"""
A B C D E
0 23 3 5 2 6
1 2 3 5 6 3
2 34 6 5 7 4
3 2 3 5 6 3"""
d1=pd.concat((b,c))
d2=pd.concat((b,c),axis=1)
print(d1,"\n\n",d2)
"""
0 B C
0 23.0 NaN NaN
1 2.0 NaN NaN
0 NaN 3.0 5.0
1 NaN 3.0 5.0
A B C
0 23 3 5
1 2 3 5"""
e1=pd.DataFrame({"key":["foo","foo"],"nba":[23,69]})
e2=pd.DataFrame({"key":["foo","foo"],"cba":[97,8]})
e=pd.merge(e1,e2,on="key")
f1=pd.DataFrame({"key":["foo","bar"],"nba":[23,69]})
f2=pd.DataFrame({"key":["foo","bar"],"cba":[97,8]})
f=pd.merge(f1,f2,on="key")
print(e,"\n\n",f)
"""
key nba cba
0 foo 23 97
1 foo 23 8
2 foo 69 97
3 foo 69 8
key nba cba
0 foo 23 97
1 bar 69 8"""
分组
例子代码:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df.groupby("A").sum()
"""
C D
A
bar -1.818018 0.920887
foo -1.083928 2.312676 """
df.groupby(["A","B"]).sum()
"""
C D
A B
bar one 0.211772 -0.468906
three -1.604844 0.910890
two -0.424946 0.478903
foo one -1.435132 1.477868
three -0.333705 0.011274
two 0.684909 0.823534 """
数据透视表
例子代码:
a = pd.DataFrame({"A": ["A", "B", "C"] * 4,
"B": ["one", "two", "three", "four"] * 3,
"C": ["foo", "bar"] * 6,
"D": np.random.randn(12) * 10,
"E": np.random.randn(12) * 10})
b1 = pd.pivot_table(a, values=["D"], index=["A", "B"], columns="C")
b2 = pd.pivot_table(a, values=["D","E"], index=["A", "B"], columns="C")
print(a,"\n\n",b1,"\n\n",b2)
"""
A B C D E
0 A one foo -4.914913 14.486079
1 B two bar -8.004826 -18.946289
2 C three foo -11.676869 -16.178788
3 A four bar 23.877449 -4.409281
4 B one foo 6.695333 -4.126402
5 C two bar 11.015809 -16.389535
6 A three foo 3.195449 2.012969
7 B four bar -0.299516 7.940081
8 C one foo 11.646412 12.938037
9 A two bar -1.989591 10.057017
10 B three foo -8.995876 -2.928453
11 C four bar -2.779193 -12.932097
D
C bar foo
A B
A four 23.877449 NaN
one NaN -4.914913
three NaN 3.195449
two -1.989591 NaN
B four -0.299516 NaN
one NaN 6.695333
three NaN -8.995876
two -8.004826 NaN
C four -2.779193 NaN
one NaN 11.646412
three NaN -11.676869
two 11.015809 NaN
D E
C bar foo bar foo
A B
A four 23.877449 NaN -4.409281 NaN
one NaN -4.914913 NaN 14.486079
three NaN 3.195449 NaN 2.012969
two -1.989591 NaN 10.057017 NaN
B four -0.299516 NaN 7.940081 NaN
one NaN 6.695333 NaN -4.126402
three NaN -8.995876 NaN -2.928453
two -8.004826 NaN -18.946289 NaN
C four -2.779193 NaN -12.932097 NaN
one NaN 11.646412 NaN 12.938037
three NaN -11.676869 NaN -16.178788
two 11.015809 NaN -16.389535 NaN
"""
时间序列
例子代码:
a=pd.date_range("1/2/2019",periods=30,freq="D")
print(a)
DatetimeIndex(['2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05',
'2019-01-06', '2019-01-07', '2019-01-08', '2019-01-09',
'2019-01-10', '2019-01-11', '2019-01-12', '2019-01-13',
'2019-01-14', '2019-01-15', '2019-01-16', '2019-01-17',
'2019-01-18', '2019-01-19', '2019-01-20', '2019-01-21',
'2019-01-22', '2019-01-23', '2019-01-24', '2019-01-25',
'2019-01-26', '2019-01-27', '2019-01-28', '2019-01-29',
'2019-01-30', '2019-01-31'],
dtype='datetime64[ns]', freq='D')
数据读取/导出
pd.read_excel()
pd.read_csv()
Dataframe.to_excel("",sheet_name="")#导出excel
Dataframe.to_csv("") #导出csv