pandas入门教程
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#coding:utf8
%matplotlib inline
创建对象
1、通过传递一个list来创建Series,pandas会默认创建整型索引:
s = pd.Series([1,3,5,np.nan,6,8])
s
'''
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
'''
2、通过传递一个numpy array,日期索引以及列标签来创建一个DataFrame:
dates = pd.date_range('20130101', periods=6)
dates
'''
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
'''
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
'''
A B C D
2013-01-01 -0.900524 -0.302515 -0.541762 1.562916
2013-01-02 -0.884117 -0.650741 0.217345 0.268915
2013-01-03 0.220822 0.790527 0.692172 0.723441
2013-01-04 1.260276 1.000297 0.809801 -0.389713
2013-01-05 1.679381 1.468609 0.360648 -0.240850
2013-01-06 0.567867 0.235352 1.117395 -0.604326
'''
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3]*4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2
'''
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
'''
3、各列的数据类型为:
df2.dtypes
'''
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
'''
查看数据
1、查看frame中头部和尾部的几行:
df.head() #默认查看前五行,想查看更多行可使用 df.head(n)
'''
A B C D
2013-01-01 -0.900524 -0.302515 -0.541762 1.562916
2013-01-02 -0.884117 -0.650741 0.217345 0.268915
2013-01-03 0.220822 0.790527 0.692172 0.723441
2013-01-04 1.260276 1.000297 0.809801 -0.389713
2013-01-05 1.679381 1.468609 0.360648 -0.240850
'''
df.tail(3)
'''
A B C D
2013-01-04 1.260276 1.000297 0.809801 -0.389713
2013-01-05 1.679381 1.468609 0.360648 -0.240850
2013-01-06 0.567867 0.235352 1.117395 -0.604326
'''
2、显示索引、列名以及底层的numpy数据
df.index
'''
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
'''
df.columns
'''
Index(['A', 'B', 'C', 'D'], dtype='object')
'''
df.values
'''
array([[-0.90052384, -0.30251543, -0.54176245, 1.56291588],
[-0.8841172 , -0.65074073, 0.21734508, 0.26891483],
[ 0.22082238, 0.79052719, 0.69217223, 0.72344092],
[ 1.2602764 , 1.0002968 , 0.80980141, -0.38971272],
[ 1.67938067, 1.46860938, 0.36064787, -0.24084994],
[ 0.56786654, 0.23535244, 1.117395 , -0.60432593]])
'''
2、describe()能对数据做一个快速统计汇总
df.describe()
'''
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.323951 0.423588 0.442600 0.220064
std 1.071708 0.809463 0.579465 0.815219
min -0.900524 -0.650741 -0.541762 -0.604326
25% -0.607882 -0.168048 0.253171 -0.352497
50% 0.394344 0.512940 0.526410 0.014032
75% 1.087174 0.947854 0.780394 0.609809
max 1.679381 1.468609 1.117395 1.562916
'''
3、对数据做转置:
df.T
'''
2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00 2013-01-05 00:00:00 2013-01-06 00:00:00
A -0.900524 -0.884117 0.220822 1.260276 1.679381 0.567867
B -0.302515 -0.650741 0.790527 1.000297 1.468609 0.235352
C -0.541762 0.217345 0.692172 0.809801 0.360648 1.117395
D 1.562916 0.268915 0.723441 -0.389713 -0.240850 -0.604326
'''
4、按轴进行排序:
df.sort_index(axis=1, ascending=False)
'''
D C B A
2013-01-01 1.562916 -0.541762 -0.302515 -0.900524
2013-01-02 0.268915 0.217345 -0.650741 -0.884117
2013-01-03 0.723441 0.692172 0.790527 0.220822
2013-01-04 -0.389713 0.809801 1.000297 1.260276
2013-01-05 -0.240850 0.360648 1.468609 1.679381
2013-01-06 -0.604326 1.117395 0.235352 0.567867
'''
5、按值进行排序 :
df.sort_values(by='B')
'''
A B C D
2013-01-02 -0.884117 -0.650741 0.217345 0.268915
2013-01-01 -0.900524 -0.302515 -0.541762 1.562916
2013-01-06 0.567867 0.235352 1.117395 -0.604326
2013-01-03 0.220822 0.790527 0.692172 0.723441
2013-01-04 1.260276 1.000297 0.809801 -0.389713
2013-01-05 1.679381 1.468609 0.360648 -0.240850
'''
数据选择
选取
1、选择某一列数据,它会返回一个Series,等同于df.A:
df['A']
'''
2013-01-01 -0.900524
2013-01-02 -0.884117
2013-01-03 0.220822
2013-01-04 1.260276
2013-01-05 1.679381
2013-01-06 0.567867
Freq: D, Name: A, dtype: float64
'''
2、通过使用[ ]进行切片选取:
df[0:3]
'''
A B C D
2013-01-01 -0.900524 -0.302515 -0.541762 1.562916
2013-01-02 -0.884117 -0.650741 0.217345 0.268915
2013-01-03 0.220822 0.790527 0.692172 0.723441
'''
df['20130102':'20130104']
'''
A B C D
2013-01-02 -0.884117 -0.650741 0.217345 0.268915
2013-01-03 0.220822 0.790527 0.692172 0.723441
2013-01-04 1.260276 1.000297 0.809801 -0.389713
'''
通过标签选取
1、通过标签进行交叉选取:
df.loc[dates[0]]
'''
A -0.900524
B -0.302515
C -0.541762
D 1.562916
Name: 2013-01-01 00:00:00, dtype: float64
'''
2、使用标签对多个轴进行选取
df.loc[:,['A','B']]
'''
A B
2013-01-01 -0.900524 -0.302515
2013-01-02 -0.884117 -0.650741
2013-01-03 0.220822 0.790527
2013-01-04 1.260276 1.000297
2013-01-05 1.679381 1.468609
2013-01-06 0.567867 0.235352
'''
df.loc[:,['A','B']][:3]
'''
A B
2013-01-01 -0.900524 -0.302515
2013-01-02 -0.884117 -0.650741
2013-01-03 0.220822 0.790527
'''
3、进行标签切片,包含两个端点
df.loc['20130102':'20130104',['A','B']]
'''
A B
2013-01-02 -0.884117 -0.650741
2013-01-03 0.220822 0.790527
2013-01-04 1.260276 1.000297
'''
4、对于返回的对象进行降维处理
df.loc['20130102',['A','B']]
'''
A -0.884117
B -0.650741
Name: 2013-01-02 00:00:00, dtype: float64
'''
5、获取一个标量
df.loc[dates[0],'A']
'''
-0.9005238449408509
'''
6、快速获取标量(与上面的方法等价)
df.at[dates[0],'A']
'''
-0.9005238449408509
'''
通过位置选取
通过传递整型的位置进行选取
df.iloc[3]
'''
A 1.260276
B 1.000297
C 0.809801
D -0.389713
Name: 2013-01-04 00:00:00, dtype: float64
'''
通过整型的位置切片进行选取,与python/numpy形式相同
df.iloc[3:5,0:2]
'''
A B
2013-01-04 1.260276 1.000297
2013-01-05 1.679381 1.468609
'''
只对行进行切片
df.iloc[1:3,:]
'''
A B C D
2013-01-02 -0.884117 -0.650741 0.217345 0.268915
2013-01-03 0.220822 0.790527 0.692172 0.723441
'''
只对列进行切片
df.iloc[:,1:3]
'''
B C
2013-01-01 -0.302515 -0.541762
2013-01-02 -0.650741 0.217345
2013-01-03 0.790527 0.692172
2013-01-04 1.000297 0.809801
2013-01-05 1.468609 0.360648
2013-01-06 0.235352 1.117395
'''
只获取某个值
df.iloc[1,1]
'''
-0.6507407272837356
'''
df.iat[1,1]
'''
-0.6507407272837356
'''
布尔索引
用某列的值来选取数据
df[df.A > 0]
'''
A B C D
2013-01-03 0.220822 0.790527 0.692172 0.723441
2013-01-04 1.260276 1.000297 0.809801 -0.389713
2013-01-05 1.679381 1.468609 0.360648 -0.240850
2013-01-06 0.567867 0.235352 1.117395 -0.604326
'''
用where操作来选取数据
df[df > 0]
'''
A B C D
2013-01-01 NaN NaN NaN 1.562916
2013-01-02 NaN NaN 0.217345 0.268915
2013-01-03 0.220822 0.790527 0.692172 0.723441
2013-01-04 1.260276 1.000297 0.809801 NaN
2013-01-05 1.679381 1.468609 0.360648 NaN
2013-01-06 0.567867 0.235352 1.117395 NaN
'''
用isin()方法来过滤数据
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
'''
A B C D E
2013-01-01 -0.900524 -0.302515 -0.541762 1.562916 one
2013-01-02 -0.884117 -0.650741 0.217345 0.268915 one
2013-01-03 0.220822 0.790527 0.692172 0.723441 two
2013-01-04 1.260276 1.000297 0.809801 -0.389713 three
2013-01-05 1.679381 1.468609 0.360648 -0.240850 four
2013-01-06 0.567867 0.235352 1.117395 -0.604326 three
'''
df2[df2['E'].isin(['two', 'four'])]
'''
A B C D E
2013-01-03 0.220822 0.790527 0.692172 0.723441 two
2013-01-05 1.679381 1.468609 0.360648 -0.240850 four
'''
赋值
赋值一个新的列,通过索引来自动对齐数据
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102',periods=6))
s1
'''
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
'''
df['F'] = s1
df
'''
A B C D F
2013-01-01 -0.900524 -0.302515 -0.541762 1.562916 NaN
2013-01-02 -0.884117 -0.650741 0.217345 0.268915 1.0
2013-01-03 0.220822 0.790527 0.692172 0.723441 2.0
2013-01-04 1.260276 1.000297 0.809801 -0.389713 3.0
2013-01-05 1.679381 1.468609 0.360648 -0.240850 4.0
2013-01-06 0.567867 0.235352 1.117395 -0.604326 5.0
'''
通过标签赋值
df.at[dates[0], 'A'] = 0
df
'''
A B C D F
2013-01-01 0.000000 -0.302515 -0.541762 1.562916 NaN
2013-01-02 -0.884117 -0.650741 0.217345 0.268915 1.0
2013-01-03 0.220822 0.790527 0.692172 0.723441 2.0
2013-01-04 1.260276 1.000297 0.809801 -0.389713 3.0
2013-01-05 1.679381 1.468609 0.360648 -0.240850 4.0
2013-01-06 0.567867 0.235352 1.117395 -0.604326 5.0
'''
通过位置赋值
df.iat[0,1] = 0
df
'''
A B C D F
2013-01-01 0.000000 0.000000 -0.541762 1.562916 NaN
2013-01-02 -0.884117 -0.650741 0.217345 0.268915 1.0
2013-01-03 0.220822 0.790527 0.692172 0.723441 2.0
2013-01-04 1.260276 1.000297 0.809801 -0.389713 3.0
2013-01-05 1.679381 1.468609 0.360648 -0.240850 4.0
2013-01-06 0.567867 0.235352 1.117395 -0.604326 5.0
'''
通过传递numpy array赋值
df.loc[:,'D'] = np.array([5] * len(df))
df
'''
A B C D F
2013-01-01 0.000000 0.000000 -0.541762 5 NaN
2013-01-02 -0.884117 -0.650741 0.217345 5 1.0
2013-01-03 0.220822 0.790527 0.692172 5 2.0
2013-01-04 1.260276 1.000297 0.809801 5 3.0
2013-01-05 1.679381 1.468609 0.360648 5 4.0
2013-01-06 0.567867 0.235352 1.117395 5 5.0
'''
通过where操作来赋值
df2 = df.copy()
df2[df2 > 0] = -df2
df2
'''
A B C D F
2013-01-01 0.000000 0.000000 -0.541762 -5 NaN
2013-01-02 -0.884117 -0.650741 -0.217345 -5 -1.0
2013-01-03 -0.220822 -0.790527 -0.692172 -5 -2.0
2013-01-04 -1.260276 -1.000297 -0.809801 -5 -3.0
2013-01-05 -1.679381 -1.468609 -0.360648 -5 -4.0
2013-01-06 -0.567867 -0.235352 -1.117395 -5 -5.0
'''