导包
In [1]: import numpy as np
In [2]: import pandas as pd
相关对象创建
传递一个值数组, 创建Series
In [3]: s = pd.Series([1,3,5,np.nan,6,8])
In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
创建DataFrame, 传入NumPy array,datatime索引,列标签ABCD
In [5]: dates = pd.date_range('20190101',periods=6)
In [6]: dates
Out[6]:
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
In [7]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
In [8]: df
Out[8]:
A B C D
2019-01-01 -1.040037 -1.655561 0.471708 -0.809498
2019-01-02 0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633
2019-01-04 -0.473237 -0.809576 0.582323 0.193678
2019-01-05 0.686279 -0.501279 1.012377 0.487575
2019-01-06 0.355412 0.692951 0.194283 -0.666775
创建DataFrame,传入一个字典对象
In [9]: df2 = pd.DataFrame({'A': 1.,
...: 'B': pd.Timestamp('20130102'),
...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...: 'D': np.array([3] * 4, dtype='int32'),
...: 'E': pd.Categorical(["test", "train", "test", "train"]),
...: 'F': 'foo'})
...:
In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [11]: df2.dtypes
Out[11]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
查看数据
# df.head默认获取顶部前五的数据,head(2)可以带参数
In [13]: df.head()
Out[13]:
A B C D
2019-01-01 -1.040037 -1.655561 0.471708 -0.809498
2019-01-02 0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633
2019-01-04 -0.473237 -0.809576 0.582323 0.193678
2019-01-05 0.686279 -0.501279 1.012377 0.487575
# 底部三条数据
In [15]: df.tail(3)
Out[15]:
A B C D
2019-01-04 -0.473237 -0.809576 0.582323 0.193678
2019-01-05 0.686279 -0.501279 1.012377 0.487575
2019-01-06 0.355412 0.692951 0.194283 -0.666775
# 查看索引
In [16]: df.index
Out[16]:
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
# 查看列
In [17]: df.columns
Out[17]: Index(['A', 'B', 'C', 'D'], dtype='object')
# DataFrame转换成Numpy arrays,如果DataFrame中有不同的types,这种转换成本较高
In [18]: df.to_numpy()
Out[18]:
array([[-1.04003744, -1.65556121, 0.47170848, -0.80949801],
[ 0.52075804, -1.24708519, -0.11052667, -0.8950744 ],
[-0.60865415, -0.23642174, 0.21714495, -0.03463295],
[-0.47323726, -0.80957612, 0.58232274, 0.19367774],
[ 0.68627916, -0.50127881, 1.01237733, 0.48757525],
[ 0.35541162, 0.69295063, 0.19428342, -0.66677523]])
In [20]: df
Out[20]:
A B C D
2019-01-01 -1.040037 -1.655561 0.471708 -0.809498
2019-01-02 0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633
2019-01-04 -0.473237 -0.809576 0.582323 0.193678
2019-01-05 0.686279 -0.501279 1.012377 0.487575
2019-01-06 0.355412 0.692951 0.194283 -0.666775
# 行列 数据转换
In [21]: df.T
Out[21]:
2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06
A -1.040037 0.520758 -0.608654 -0.473237 0.686279 0.355412
B -1.655561 -1.247085 -0.236422 -0.809576 -0.501279 0.692951
C 0.471708 -0.110527 0.217145 0.582323 1.012377 0.194283
D -0.809498 -0.895074 -0.034633 0.193678 0.487575 -0.666775
# 数据排序
In [23]: df.sort_index(axis=1,ascending=False)
Out[23]:
D C B A
2019-01-01 -0.809498 0.471708 -1.655561 -1.040037
2019-01-02 -0.895074 -0.110527 -1.247085 0.520758
2019-01-03 -0.034633 0.217145 -0.236422 -0.608654
2019-01-04 0.193678 0.582323 -0.809576 -0.473237
2019-01-05 0.487575 1.012377 -0.501279 0.686279
2019-01-06 -0.666775 0.194283 0.692951 0.355412
# 通过值排序
In [31]: df.sort_values(by='B',ascending=False)
Out[31]:
A B C D
2019-01-06 0.355412 0.692951 0.194283 -0.666775
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633
2019-01-05 0.686279 -0.501279 1.012377 0.487575
2019-01-04 -0.473237 -0.809576 0.582323 0.193678
2019-01-02 0.520758 -1.247085 -0.110527 -0.895074
2019-01-01 -1.040037 -1.655561 0.471708 -0.809498
选择
选择单一列
df[‘A’] 或者 df.A,返回的是 pandas.core.series.Series
In [32]: df['A']
Out[32]:
2019-01-01 -1.040037
2019-01-02 0.520758
2019-01-03 -0.608654
2019-01-04 -0.473237
2019-01-05 0.686279
2019-01-06 0.355412
Freq: D, Name: A, dtype: float64
In [33]: df.A
Out[33]:
2019-01-01 -1.040037
2019-01-02 0.520758
2019-01-03 -0.608654
2019-01-04 -0.473237
2019-01-05 0.686279
2019-01-06 0.355412
Freq: D, Name: A, dtype: float64
切片,选行
In [35]: df[0:3]
Out[35]:
A B C D
2019-01-01 -1.040037 -1.655561 0.471708 -0.809498
2019-01-02 0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633
In [37]: df['20190103':'20190105']
Out[37]:
A B C D
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633
2019-01-04 -0.473237 -0.809576 0.582323 0.193678
2019-01-05 0.686279 -0.501279 1.012377 0.487575
标签选择 df.loc
标签选行
In [38]: dates
Out[38]:
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
In [39]: df.loc[dates[0]]
Out[39]:
A -1.040037
B -1.655561
C 0.471708
D -0.809498
Name: 2019-01-01 00:00:00, dtype: float64
# 通过标签选多轴数据
In [40]: df.loc[:,['A','B']]
Out[40]:
A B
2019-01-01 -1.040037 -1.655561
2019-01-02 0.520758 -1.247085
2019-01-03 -0.608654 -0.236422
2019-01-04 -0.473237 -0.809576
2019-01-05 0.686279 -0.501279
2019-01-06 0.355412 0.692951
In [41]: df.loc['20190102':'20190104',['A','B']]
Out[41]:
A B
2019-01-02 0.520758 -1.247085
2019-01-03 -0.608654 -0.236422
2019-01-04 -0.473237 -0.809576
In [42]: df.loc['20190102',['A','B']]
Out[42]:
A 0.520758
B -1.247085
Name: 2019-01-02 00:00:00, dtype: float64
# 获取标量值
In [43]: df.loc[dates[0],'A']
Out[43]: -1.0400374366456964
# 快速获取标量值
In [44]: df.at[dates[0],'A']
Out[44]: -1.0400374366456964
通过位置选取df.iloc
In [45]: df.iloc[3]
Out[45]:
A -0.473237
B -0.809576
C 0.582323
D 0.193678
Name: 2019-01-04 00:00:00, dtype: float64
# 整型切片
In [46]: df.iloc[3:5,0:2]
Out[46]:
A B
2019-01-04 -0.473237 -0.809576
2019-01-05 0.686279 -0.501279
In [47]: df.iloc[[1,2,4],[0,2]]
Out[47]:
A C
2019-01-02 0.520758 -0.110527
2019-01-03 -0.608654 0.217145
2019-01-05 0.686279 1.012377
In [48]: df.iloc[1:3,:]
Out[48]:
A B C D
2019-01-02 0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633
In [49]: df.iloc[:,1:3]
Out[49]:
B C
2019-01-01 -1.655561 0.471708
2019-01-02 -1.247085 -0.110527
2019-01-03 -0.236422 0.217145
2019-01-04 -0.809576 0.582323
2019-01-05 -0.501279 1.012377
2019-01-06 0.692951 0.194283
In [50]: df.iloc[1,1]
Out[50]: -1.2470851898310984
# 快速获取标量值
In [51]: df.iat[1,1]
Out[51]: -1.2470851898310984
Boolean索引
In [52]: df[df>0]
Out[52]:
A B C D
2019-01-01 NaN NaN 0.471708 NaN
2019-01-02 0.520758 NaN NaN NaN
2019-01-03 NaN NaN 0.217145 NaN
2019-01-04 NaN NaN 0.582323 0.193678
2019-01-05 0.686279 NaN 1.012377 0.487575
2019-01-06 0.355412 0.692951 0.194283 NaN
In [53]: df2 = df.copy()
In [54]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
In [55]: df2
Out[55]:
A B C D E
2019-01-01 -1.040037 -1.655561 0.471708 -0.809498 one
2019-01-02 0.520758 -1.247085 -0.110527 -0.895074 one
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633 two
2019-01-04 -0.473237 -0.809576 0.582323 0.193678 three
2019-01-05 0.686279 -0.501279 1.012377 0.487575 four
2019-01-06 0.355412 0.692951 0.194283 -0.666775 three
In [57]: df2[df2['E'].isin(['two','four'])]
Out[57]:
A B C D E
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633 two
2019-01-05 0.686279 -0.501279 1.012377 0.487575 four
设置值
设置一个新的列,自动根据索引对齐数据
In [59]: df
Out[59]:
A B C D
2019-01-01 -1.040037 -1.655561 0.471708 -0.809498
2019-01-02 0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633
2019-01-04 -0.473237 -0.809576 0.582323 0.193678
2019-01-05 0.686279 -0.501279 1.012377 0.487575
2019-01-06 0.355412 0.692951 0.194283 -0.666775
In [60]: s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20190102',periods=6))
In [61]: s1
Out[61]:
2019-01-02 1
2019-01-03 2
2019-01-04 3
2019-01-05 4
2019-01-06 5
2019-01-07 6
Freq: D, dtype: int64
In [62]: df['F'] = s1
In [63]: df
Out[63]:
A B C D F
2019-01-01 -1.040037 -1.655561 0.471708 -0.809498 NaN
2019-01-02 0.520758 -1.247085 -0.110527 -0.895074 1.0
2019-01-03 -0.608654 -0.236422 0.217145 -0.034633 2.0
2019-01-04 -0.473237 -0.809576 0.582323 0.193678 3.0
2019-01-05 0.686279 -0.501279 1.012377 0.487575 4.0
2019-01-06 0.355412 0.692951 0.194283 -0.666775 5.0
# 通过标签设置值
In [66]: df.at[dates[0],'A'] = 0
# 通过位置设置值
In [68]: df.iat[0,1] = 0
# 设置Numpy array
In [70]: df.loc[:,'D'] = np.array([5]*len(df))
In [71]: df
Out[71]:
A B C D F
2019-01-01 0.000000 0.000000 0.471708 5 NaN
2019-01-02 0.520758 -1.247085 -0.110527 5 1.0
2019-01-03 -0.608654 -0.236422 0.217145 5 2.0
2019-01-04 -0.473237 -0.809576 0.582323 5 3.0
2019-01-05 0.686279 -0.501279 1.012377 5 4.0
2019-01-06 0.355412 0.692951 0.194283 5 5.0
缺失数据np.nan
在pandas中主要是用np.nan标识缺失值,在默认情况下nan是不参与计算的。
缺失数据
reindex允许你在指定轴上改变索引,它返回的数据的拷贝。
# 根据索引的更改,返回df数据的拷贝
In [81]: df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
In [82]: df1
Out[82]:
A B C D F E
2019-01-01 0.000000 0.000000 0.471708 5 NaN NaN
2019-01-02 0.520758 -1.247085 -0.110527 5 1.0 NaN
2019-01-03 -0.608654 -0.236422 0.217145 5 2.0 NaN
2019-01-04 -0.473237 -0.809576 0.582323 5 3.0 NaN
In [84]: df1.loc[dates[0]:dates[1],'E']=1
In [85]: df1
Out[85]:
A B C D F E
2019-01-01 0.000000 0.000000 0.471708 5 NaN 1.0
2019-01-02 0.520758 -1.247085 -0.110527 5 1.0 1.0
2019-01-03 -0.608654 -0.236422 0.217145 5 2.0 NaN
2019-01-04 -0.473237 -0.809576 0.582323 5 3.0 NaN
# 删除任何缺失值的行
In [86]: df1.dropna(how='any')
Out[86]:
A B C D F E
2019-01-02 0.520758 -1.247085 -0.110527 5 1.0 1.0
# 填充缺失值
In [87]: df1.fillna(value=5)
Out[87]:
A B C D F E
2019-01-01 0.000000 0.000000 0.471708 5 5.0 1.0
2019-01-02 0.520758 -1.247085 -0.110527 5 1.0 1.0
2019-01-03 -0.608654 -0.236422 0.217145 5 2.0 5.0
2019-01-04 -0.473237 -0.809576 0.582323 5 3.0 5.0
# 返回boolean mask
In [90]: pd.isna(df1)
Out[90]:
A B C D F E
2019-01-01 False False False False True False
2019-01-02 False False False False False False
2019-01-03 False False False False False True
2019-01-04 False False False False False True
操作符
# 平均数
In [91]: df.mean()
Out[91]:
A 0.080093
B -0.350235
C 0.394552
D 5.000000
F 3.000000
dtype: float64
# 向下移2行
In [94]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [95]: s
Out[95]:
2019-01-01 NaN
2019-01-02 NaN
2019-01-03 1.0
2019-01-04 3.0
2019-01-05 5.0
2019-01-06 NaN
Freq: D, dtype: float64
In [96]: df
Out[96]:
A B C D F
2019-01-01 0.000000 0.000000 0.471708 5 NaN
2019-01-02 0.520758 -1.247085 -0.110527 5 1.0
2019-01-03 -0.608654 -0.236422 0.217145 5 2.0
2019-01-04 -0.473237 -0.809576 0.582323 5 3.0
2019-01-05 0.686279 -0.501279 1.012377 5 4.0
2019-01-06 0.355412 0.692951 0.194283 5 5.0
In [97]: df.sub(s,axis='index')
Out[97]:
A B C D F
2019-01-01 NaN NaN NaN NaN NaN
2019-01-02 NaN NaN NaN NaN NaN
2019-01-03 -1.608654 -1.236422 -0.782855 4.0 1.0
2019-01-04 -3.473237 -3.809576 -2.417677 2.0 0.0
2019-01-05 -4.313721 -5.501279 -3.987623 0.0 -1.0
2019-01-06 NaN NaN NaN NaN NaN
Apply
Applying functions to the data:
In [98]: df.apply(np.cumsum)
Out[98]:
A B C D F
2019-01-01 0.000000 0.000000 0.471708 5 NaN
2019-01-02 0.520758 -1.247085 0.361182 10 1.0
2019-01-03 -0.087896 -1.483507 0.578327 15 3.0
2019-01-04 -0.561133 -2.293083 1.160650 20 6.0
2019-01-05 0.125146 -2.794362 2.173027 25 10.0
2019-01-06 0.480557 -2.101411 2.367310 30 15.0
In [99]: df.apply(lambda x:x.max() - x.min())
Out[99]:
A 1.294933
B 1.940036
C 1.122904
D 0.000000
F 4.000000
dtype: float64
Join
类似于SQL风格的merges ,See the Database style joining
In [8]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [9]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [10]: left
Out[10]:
key lval
0 foo 1
1 foo 2
In [11]: right
Out[11]:
key rval
0 foo 4
1 foo 5
In [12]: pd.merge(left,right,on='key')
Out[12]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
In [13]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [14]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [15]: left
Out[15]:
key lval
0 foo 1
1 bar 2
In [16]: right
Out[16]:
key rval
0 foo 4
1 bar 5
In [17]: pd.merge(left,right,on='key')
Out[17]:
key lval rval
0 foo 1 4
1 bar 2 5
Append
添加行到 DataFrame ,see the Append
In [3]: df = pd.DataFrame(np.random.randn(8,4),columns=list('ABCD'))
In [4]: df
Out[4]:
A B C D
0 1.465956 -0.255432 0.000892 0.711403
1 -1.284182 0.475346 -0.842136 -0.016690
2 1.033265 -0.786159 -1.028520 -2.191931
3 0.704121 -0.191651 0.209047 0.606736
4 -0.282906 -0.095775 0.368705 -0.714109
5 -1.325880 -0.435477 0.850966 0.918388
6 -1.497890 2.836506 0.963163 0.073528
7 -0.226545 1.809605 1.025409 0.208270
In [5]: s = df.iloc[3]
In [6]: s
Out[6]:
A 0.704121
B -0.191651
C 0.209047
D 0.606736
Name: 3, dtype: float64
In [7]: df.append(s,ignore_index=True)
Out[7]:
A B C D
0 1.465956 -0.255432 0.000892 0.711403
1 -1.284182 0.475346 -0.842136 -0.016690
2 1.033265 -0.786159 -1.028520 -2.191931
3 0.704121 -0.191651 0.209047 0.606736
4 -0.282906 -0.095775 0.368705 -0.714109
5 -1.325880 -0.435477 0.850966 0.918388
6 -1.497890 2.836506 0.963163 0.073528
7 -0.226545 1.809605 1.025409 0.208270
8 0.704121 -0.191651 0.209047 0.606736
读写数据
写csv文件,读csv文件
HDFStores
MS Excel
# 往csv文件写数据
df.to_csv('foo.csv')
# 从csv文件读数据
df = pd.read_csv('foo.csv')
# 读写HDF5
df.to_hdf('foo.h5', 'df')
pd.read_hdf('foo.h5', 'df')
# 读写Excel
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])