Pandas入门教程

导包

In [1]: import numpy as np                                                      
In [2]: import pandas as pd 

相关对象创建

传递一个值数组, 创建Series

In [3]: s = pd.Series([1,3,5,np.nan,6,8])                                       
In [4]: s                                                                       
Out[4]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

创建DataFrame, 传入NumPy array,datatime索引,列标签ABCD

In [5]: dates = pd.date_range('20190101',periods=6)                             

In [6]: dates                                                                   
Out[6]: 
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')

In [7]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))                                                                        
In [8]: df                                                                      
Out[8]: 
                   A         B         C         D
2019-01-01 -1.040037 -1.655561  0.471708 -0.809498
2019-01-02  0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633
2019-01-04 -0.473237 -0.809576  0.582323  0.193678
2019-01-05  0.686279 -0.501279  1.012377  0.487575
2019-01-06  0.355412  0.692951  0.194283 -0.666775

创建DataFrame,传入一个字典对象

In [9]: df2 = pd.DataFrame({'A': 1.,
   ...:                     'B': pd.Timestamp('20130102'),
   ...:                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   ...:                     'D': np.array([3] * 4, dtype='int32'),
   ...:                     'E': pd.Categorical(["test", "train", "test", "train"]),
   ...:                     'F': 'foo'})
   ...: 

In [10]: df2
Out[10]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

In [11]: df2.dtypes                                                               
Out[11]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

# df.head默认获取顶部前五的数据,head(2)可以带参数
In [13]: df.head()                                                                                
Out[13]: 
                   A         B         C         D
2019-01-01 -1.040037 -1.655561  0.471708 -0.809498
2019-01-02  0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633
2019-01-04 -0.473237 -0.809576  0.582323  0.193678
2019-01-05  0.686279 -0.501279  1.012377  0.487575
# 底部三条数据
In [15]: df.tail(3)                                                                               
Out[15]: 
                   A         B         C         D
2019-01-04 -0.473237 -0.809576  0.582323  0.193678
2019-01-05  0.686279 -0.501279  1.012377  0.487575
2019-01-06  0.355412  0.692951  0.194283 -0.666775

# 查看索引
In [16]: df.index                                                                                 
Out[16]: 
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')
# 查看列
In [17]: df.columns                                                                               
Out[17]: Index(['A', 'B', 'C', 'D'], dtype='object')  

# DataFrame转换成Numpy arrays,如果DataFrame中有不同的types,这种转换成本较高
In [18]: df.to_numpy()                                                                            
Out[18]: 
array([[-1.04003744, -1.65556121,  0.47170848, -0.80949801],
       [ 0.52075804, -1.24708519, -0.11052667, -0.8950744 ],
       [-0.60865415, -0.23642174,  0.21714495, -0.03463295],
       [-0.47323726, -0.80957612,  0.58232274,  0.19367774],
       [ 0.68627916, -0.50127881,  1.01237733,  0.48757525],
       [ 0.35541162,  0.69295063,  0.19428342, -0.66677523]])  

In [20]: df                                                                                       
Out[20]: 
                   A         B         C         D
2019-01-01 -1.040037 -1.655561  0.471708 -0.809498
2019-01-02  0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633
2019-01-04 -0.473237 -0.809576  0.582323  0.193678
2019-01-05  0.686279 -0.501279  1.012377  0.487575
2019-01-06  0.355412  0.692951  0.194283 -0.666775

# 行列 数据转换
In [21]: df.T                                                                                     
Out[21]: 
   2019-01-01  2019-01-02  2019-01-03  2019-01-04  2019-01-05  2019-01-06
A   -1.040037    0.520758   -0.608654   -0.473237    0.686279    0.355412
B   -1.655561   -1.247085   -0.236422   -0.809576   -0.501279    0.692951
C    0.471708   -0.110527    0.217145    0.582323    1.012377    0.194283
D   -0.809498   -0.895074   -0.034633    0.193678    0.487575   -0.666775  

# 数据排序
In [23]: df.sort_index(axis=1,ascending=False)                                                    
Out[23]: 
                   D         C         B         A
2019-01-01 -0.809498  0.471708 -1.655561 -1.040037
2019-01-02 -0.895074 -0.110527 -1.247085  0.520758
2019-01-03 -0.034633  0.217145 -0.236422 -0.608654
2019-01-04  0.193678  0.582323 -0.809576 -0.473237
2019-01-05  0.487575  1.012377 -0.501279  0.686279
2019-01-06 -0.666775  0.194283  0.692951  0.355412  
# 通过值排序
In [31]: df.sort_values(by='B',ascending=False)                                          
Out[31]: 
                   A         B         C         D
2019-01-06  0.355412  0.692951  0.194283 -0.666775
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633
2019-01-05  0.686279 -0.501279  1.012377  0.487575
2019-01-04 -0.473237 -0.809576  0.582323  0.193678
2019-01-02  0.520758 -1.247085 -0.110527 -0.895074
2019-01-01 -1.040037 -1.655561  0.471708 -0.809498             

选择

选择单一列
df[‘A’] 或者 df.A,返回的是 pandas.core.series.Series

In [32]: df['A']                                                                         
Out[32]: 
2019-01-01   -1.040037
2019-01-02    0.520758
2019-01-03   -0.608654
2019-01-04   -0.473237
2019-01-05    0.686279
2019-01-06    0.355412
Freq: D, Name: A, dtype: float64

In [33]: df.A                                                                            
Out[33]: 
2019-01-01   -1.040037
2019-01-02    0.520758
2019-01-03   -0.608654
2019-01-04   -0.473237
2019-01-05    0.686279
2019-01-06    0.355412
Freq: D, Name: A, dtype: float64

切片,选行

In [35]: df[0:3]                                                                         
Out[35]: 
                   A         B         C         D
2019-01-01 -1.040037 -1.655561  0.471708 -0.809498
2019-01-02  0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633

In [37]: df['20190103':'20190105']                                                       
Out[37]: 
                   A         B         C         D
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633
2019-01-04 -0.473237 -0.809576  0.582323  0.193678
2019-01-05  0.686279 -0.501279  1.012377  0.487575

标签选择 df.loc
标签选行

In [38]: dates                                                                                     
Out[38]: 
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')

In [39]: df.loc[dates[0]]                                                                          
Out[39]: 
A   -1.040037
B   -1.655561
C    0.471708
D   -0.809498
Name: 2019-01-01 00:00:00, dtype: float64

# 通过标签选多轴数据
In [40]: df.loc[:,['A','B']]                                                                       
Out[40]: 
                   A         B
2019-01-01 -1.040037 -1.655561
2019-01-02  0.520758 -1.247085
2019-01-03 -0.608654 -0.236422
2019-01-04 -0.473237 -0.809576
2019-01-05  0.686279 -0.501279
2019-01-06  0.355412  0.692951

In [41]: df.loc['20190102':'20190104',['A','B']]                                                   
Out[41]: 
                   A         B
2019-01-02  0.520758 -1.247085
2019-01-03 -0.608654 -0.236422
2019-01-04 -0.473237 -0.809576

In [42]: df.loc['20190102',['A','B']]                                                              
Out[42]: 
A    0.520758
B   -1.247085
Name: 2019-01-02 00:00:00, dtype: float64
# 获取标量值
In [43]: df.loc[dates[0],'A']                                                                      
Out[43]: -1.0400374366456964
# 快速获取标量值
In [44]: df.at[dates[0],'A']                                                                       
Out[44]: -1.0400374366456964

通过位置选取df.iloc

In [45]: df.iloc[3]                                                                                
Out[45]: 
A   -0.473237
B   -0.809576
C    0.582323
D    0.193678
Name: 2019-01-04 00:00:00, dtype: float64

# 整型切片
In [46]: df.iloc[3:5,0:2]                                                                          
Out[46]: 
                   A         B
2019-01-04 -0.473237 -0.809576
2019-01-05  0.686279 -0.501279

In [47]: df.iloc[[1,2,4],[0,2]]                                                                    
Out[47]: 
                   A         C
2019-01-02  0.520758 -0.110527
2019-01-03 -0.608654  0.217145
2019-01-05  0.686279  1.012377

In [48]: df.iloc[1:3,:]                                                                            
Out[48]: 
                   A         B         C         D
2019-01-02  0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633

In [49]: df.iloc[:,1:3]                                                                            
Out[49]: 
                   B         C
2019-01-01 -1.655561  0.471708
2019-01-02 -1.247085 -0.110527
2019-01-03 -0.236422  0.217145
2019-01-04 -0.809576  0.582323
2019-01-05 -0.501279  1.012377
2019-01-06  0.692951  0.194283

In [50]: df.iloc[1,1]                                                                              
Out[50]: -1.2470851898310984
# 快速获取标量值
In [51]: df.iat[1,1]                                                                               
Out[51]: -1.2470851898310984

Boolean索引

In [52]: df[df>0]                                                                                  
Out[52]: 
                   A         B         C         D
2019-01-01       NaN       NaN  0.471708       NaN
2019-01-02  0.520758       NaN       NaN       NaN
2019-01-03       NaN       NaN  0.217145       NaN
2019-01-04       NaN       NaN  0.582323  0.193678
2019-01-05  0.686279       NaN  1.012377  0.487575
2019-01-06  0.355412  0.692951  0.194283       NaN

In [53]: df2 = df.copy()                                                                           
In [54]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']                                
In [55]: df2                                                                                       
Out[55]: 
                   A         B         C         D      E
2019-01-01 -1.040037 -1.655561  0.471708 -0.809498    one
2019-01-02  0.520758 -1.247085 -0.110527 -0.895074    one
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633    two
2019-01-04 -0.473237 -0.809576  0.582323  0.193678  three
2019-01-05  0.686279 -0.501279  1.012377  0.487575   four
2019-01-06  0.355412  0.692951  0.194283 -0.666775  three

In [57]: df2[df2['E'].isin(['two','four'])]                                                        
Out[57]: 
                   A         B         C         D     E
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633   two
2019-01-05  0.686279 -0.501279  1.012377  0.487575  four

设置值

设置一个新的列,自动根据索引对齐数据

In [59]: df                                                                                        
Out[59]: 
                   A         B         C         D
2019-01-01 -1.040037 -1.655561  0.471708 -0.809498
2019-01-02  0.520758 -1.247085 -0.110527 -0.895074
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633
2019-01-04 -0.473237 -0.809576  0.582323  0.193678
2019-01-05  0.686279 -0.501279  1.012377  0.487575
2019-01-06  0.355412  0.692951  0.194283 -0.666775

In [60]: s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20190102',periods=6))                   

In [61]: s1                                                                                        
Out[61]: 
2019-01-02    1
2019-01-03    2
2019-01-04    3
2019-01-05    4
2019-01-06    5
2019-01-07    6
Freq: D, dtype: int64

In [62]: df['F'] = s1
In [63]: df                                                                                        
Out[63]: 
                   A         B         C         D    F
2019-01-01 -1.040037 -1.655561  0.471708 -0.809498  NaN
2019-01-02  0.520758 -1.247085 -0.110527 -0.895074  1.0
2019-01-03 -0.608654 -0.236422  0.217145 -0.034633  2.0
2019-01-04 -0.473237 -0.809576  0.582323  0.193678  3.0
2019-01-05  0.686279 -0.501279  1.012377  0.487575  4.0
2019-01-06  0.355412  0.692951  0.194283 -0.666775  5.0
# 通过标签设置值
In [66]: df.at[dates[0],'A'] = 0
# 通过位置设置值
In [68]: df.iat[0,1] = 0 
# 设置Numpy array
In [70]: df.loc[:,'D'] = np.array([5]*len(df)) 
In [71]: df                                                                                        
Out[71]: 
                   A         B         C  D    F
2019-01-01  0.000000  0.000000  0.471708  5  NaN
2019-01-02  0.520758 -1.247085 -0.110527  5  1.0
2019-01-03 -0.608654 -0.236422  0.217145  5  2.0
2019-01-04 -0.473237 -0.809576  0.582323  5  3.0
2019-01-05  0.686279 -0.501279  1.012377  5  4.0
2019-01-06  0.355412  0.692951  0.194283  5  5.0

缺失数据np.nan

在pandas中主要是用np.nan标识缺失值,在默认情况下nan是不参与计算的。
缺失数据
reindex允许你在指定轴上改变索引,它返回的数据的拷贝。

# 根据索引的更改,返回df数据的拷贝
In [81]: df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])                           
In [82]: df1                                                                                       
Out[82]: 
                   A         B         C  D    F   E
2019-01-01  0.000000  0.000000  0.471708  5  NaN NaN
2019-01-02  0.520758 -1.247085 -0.110527  5  1.0 NaN
2019-01-03 -0.608654 -0.236422  0.217145  5  2.0 NaN
2019-01-04 -0.473237 -0.809576  0.582323  5  3.0 NaN

In [84]: df1.loc[dates[0]:dates[1],'E']=1                                                          
In [85]: df1                                                                                       
Out[85]: 
                   A         B         C  D    F    E
2019-01-01  0.000000  0.000000  0.471708  5  NaN  1.0
2019-01-02  0.520758 -1.247085 -0.110527  5  1.0  1.0
2019-01-03 -0.608654 -0.236422  0.217145  5  2.0  NaN
2019-01-04 -0.473237 -0.809576  0.582323  5  3.0  NaN

# 删除任何缺失值的行
In [86]: df1.dropna(how='any')                                                                               
Out[86]: 
                   A         B         C  D    F    E
2019-01-02  0.520758 -1.247085 -0.110527  5  1.0  1.0

# 填充缺失值
In [87]: df1.fillna(value=5)                                                                                 
Out[87]: 
                   A         B         C  D    F    E
2019-01-01  0.000000  0.000000  0.471708  5  5.0  1.0
2019-01-02  0.520758 -1.247085 -0.110527  5  1.0  1.0
2019-01-03 -0.608654 -0.236422  0.217145  5  2.0  5.0
2019-01-04 -0.473237 -0.809576  0.582323  5  3.0  5.0

# 返回boolean mask
In [90]: pd.isna(df1)                                                                                        
Out[90]: 
                A      B      C      D      F      E
2019-01-01  False  False  False  False   True  False
2019-01-02  False  False  False  False  False  False
2019-01-03  False  False  False  False  False   True
2019-01-04  False  False  False  False  False   True

操作符

查看更多操作符

# 平均数
In [91]: df.mean()                                                                                           
Out[91]: 
A    0.080093
B   -0.350235
C    0.394552
D    5.000000
F    3.000000
dtype: float64

# 向下移2行
In [94]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)                                        
In [95]: s                                                                                                   
Out[95]: 
2019-01-01    NaN
2019-01-02    NaN
2019-01-03    1.0
2019-01-04    3.0
2019-01-05    5.0
2019-01-06    NaN
Freq: D, dtype: float64

In [96]: df                                                                                                  
Out[96]: 
                   A         B         C  D    F
2019-01-01  0.000000  0.000000  0.471708  5  NaN
2019-01-02  0.520758 -1.247085 -0.110527  5  1.0
2019-01-03 -0.608654 -0.236422  0.217145  5  2.0
2019-01-04 -0.473237 -0.809576  0.582323  5  3.0
2019-01-05  0.686279 -0.501279  1.012377  5  4.0
2019-01-06  0.355412  0.692951  0.194283  5  5.0

In [97]: df.sub(s,axis='index')                                                                              
Out[97]: 
                   A         B         C    D    F
2019-01-01       NaN       NaN       NaN  NaN  NaN
2019-01-02       NaN       NaN       NaN  NaN  NaN
2019-01-03 -1.608654 -1.236422 -0.782855  4.0  1.0
2019-01-04 -3.473237 -3.809576 -2.417677  2.0  0.0
2019-01-05 -4.313721 -5.501279 -3.987623  0.0 -1.0
2019-01-06       NaN       NaN       NaN  NaN  NaN

Apply

Applying functions to the data:

In [98]: df.apply(np.cumsum)                                                                                 
Out[98]: 
                   A         B         C   D     F
2019-01-01  0.000000  0.000000  0.471708   5   NaN
2019-01-02  0.520758 -1.247085  0.361182  10   1.0
2019-01-03 -0.087896 -1.483507  0.578327  15   3.0
2019-01-04 -0.561133 -2.293083  1.160650  20   6.0
2019-01-05  0.125146 -2.794362  2.173027  25  10.0
2019-01-06  0.480557 -2.101411  2.367310  30  15.0

In [99]: df.apply(lambda x:x.max() - x.min())                                                                
Out[99]: 
A    1.294933
B    1.940036
C    1.122904
D    0.000000
F    4.000000
dtype: float64

Join

类似于SQL风格的merges ,See the Database style joining

In [8]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})                 
In [9]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})                
In [10]: left                                                                        
Out[10]: 
   key  lval
0  foo     1
1  foo     2

In [11]: right                                                                       
Out[11]: 
   key  rval
0  foo     4
1  foo     5

In [12]: pd.merge(left,right,on='key')                                               
Out[12]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

In [13]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [14]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [15]: left                                                                        
Out[15]: 
   key  lval
0  foo     1
1  bar     2

In [16]: right                                                                       
Out[16]: 
   key  rval
0  foo     4
1  bar     5

In [17]: pd.merge(left,right,on='key')                                               
Out[17]: 
   key  lval  rval
0  foo     1     4
1  bar     2     5

Append

添加行到 DataFrame ,see the Append

In [3]: df = pd.DataFrame(np.random.randn(8,4),columns=list('ABCD'))            

In [4]: df                                                                      
Out[4]: 
          A         B         C         D
0  1.465956 -0.255432  0.000892  0.711403
1 -1.284182  0.475346 -0.842136 -0.016690
2  1.033265 -0.786159 -1.028520 -2.191931
3  0.704121 -0.191651  0.209047  0.606736
4 -0.282906 -0.095775  0.368705 -0.714109
5 -1.325880 -0.435477  0.850966  0.918388
6 -1.497890  2.836506  0.963163  0.073528
7 -0.226545  1.809605  1.025409  0.208270

In [5]: s = df.iloc[3]                                                          

In [6]: s                                                                       
Out[6]: 
A    0.704121
B   -0.191651
C    0.209047
D    0.606736
Name: 3, dtype: float64

In [7]: df.append(s,ignore_index=True)                                          
Out[7]: 
          A         B         C         D
0  1.465956 -0.255432  0.000892  0.711403
1 -1.284182  0.475346 -0.842136 -0.016690
2  1.033265 -0.786159 -1.028520 -2.191931
3  0.704121 -0.191651  0.209047  0.606736
4 -0.282906 -0.095775  0.368705 -0.714109
5 -1.325880 -0.435477  0.850966  0.918388
6 -1.497890  2.836506  0.963163  0.073528
7 -0.226545  1.809605  1.025409  0.208270
8  0.704121 -0.191651  0.209047  0.606736

读写数据

写csv文件,读csv文件
HDFStores
MS Excel

# 往csv文件写数据
df.to_csv('foo.csv')
# 从csv文件读数据
df = pd.read_csv('foo.csv')
# 读写HDF5
df.to_hdf('foo.h5', 'df')
pd.read_hdf('foo.h5', 'df')

# 读写Excel
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Pandas是一种基于NumPy的数据分析工具,它可以帮助我们对数据进行清洗、编辑和分析等工作。掌握Pandas的常规用法是构建机器学习模型的第一步。首先,我们需要安装Pandas。如果已经安装了Anaconda,可以直接使用Anaconda自带的包管理工具来安装Pandas。如果没有安装Anaconda,可以使用Python自带的包管理工具pip来安装Pandas,命令为pip install pandas。安装完成后,我们可以导入Pandas库并查询相应的版本信息。通常,我们还会导入NumPy库,因为Pandas和NumPy常常结合在一起使用。导入Pandas库的命令为import pandas as pd,导入NumPy库的命令为import numpy as np。要查询Pandas的版本信息,可以使用print(pd.__version__)命令。接下来,我们可以学习Pandas的数据类型,包括Series和DataFrame。Series是一种一维的数据结构,类似于数组或列表,而DataFrame是一种二维的数据结构,类似于表格。在学习Pandas的过程中,我们可以通过导入Excel数据、输出Excel数据、数据概览、数据查看、数据清洗、数据选择、数据排序、数据分组、数据透视、数据合并和数据可视化等操作来熟悉Pandas的用法。\[1\]\[2\]\[3\] #### 引用[.reference_title] - *1* *2* [非常全面的Pandas入门教程](https://blog.youkuaiyun.com/weixin_44489066/article/details/89494395)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^koosearch_v1,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [pandas 快速入门教程](https://blog.youkuaiyun.com/down_12345/article/details/105345429)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^koosearch_v1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值