10 minutes of pandas

本文提供了Pandas库的快速入门教程,涵盖了数据结构Series和DataFrame的创建、索引、筛选、排序、数据处理等核心功能,以及如何进行数据读取、合并、分组和透视表操作。

这是网上十分钟入门 pandas 的教程,在此手敲一遍。

ps:这哪是十分钟,tm明明敲了好久,蓝瘦香菇。

首先导入库:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

创建对象

创建一个 series 通过传递值的列表,让 pandas 创建一个整数索引:

s = pd.Series([1,2,3,4,5,np.nan,6])
s

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    NaN
6    6.0
dtype: float64

创建空值一般用 np.nan 。

DataFrame 通过传递带有日期索引和标记列的 numpy 数组来创建:

datas = pd.date_range('20180501',periods=6)
datas
DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04',
               '2018-05-05', '2018-05-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.rand(6,4),index=datas,columns=list('ABCD'))
df
ABCD
2018-05-010.2252790.7357100.0096390.408149
2018-05-020.7050070.9002390.5512070.165471
2018-05-030.6080680.3323450.5190190.181947
2018-05-040.9219580.6260030.9458280.357211
2018-05-050.3044230.8364940.7313510.678947
2018-05-060.3428600.0532110.6707770.186546
  • np.random.randn:正态分布生成随机数
  • np.random.rand:随机分布生成随机数(0-1)之间的数

通过字典类型创建 DataFrame 。

df2 = pd.DataFrame({'A':1.,
                   'B':pd.Timestamp('20180101'),
                   'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                   'D':np.array([3] * 4,dtype='int32'),
                   'E' : pd.Categorical(["test","train","test","train"]),      
                   'F':'foo'})
df2
ABCDEF
01.02018-01-011.03testfoo
11.02018-01-011.03trainfoo
21.02018-01-011.03testfoo
31.02018-01-011.03trainfoo
  • pd.Timestamp:时间戳,相当于 python 中的 datetime
  • pd.Series:dataframe 中的每一列是由 series 组成的。
  • np.array:也可以用 numpy 的数组来生成。
  • pd.Categorical:分类的数值

结果有不同的类型:

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

df.head() # 数据框中的前五条数据
ABCD
2018-05-010.2252790.7357100.0096390.408149
2018-05-020.7050070.9002390.5512070.165471
2018-05-030.6080680.3323450.5190190.181947
2018-05-040.9219580.6260030.9458280.357211
2018-05-050.3044230.8364940.7313510.678947
df.tail(3) # 尾部的三条数据
ABCD
2018-05-040.9219580.6260030.9458280.357211
2018-05-050.3044230.8364940.7313510.678947
2018-05-060.3428600.0532110.6707770.186546

显示索引

df.index # 显示索引
DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04',
               '2018-05-05', '2018-05-06'],
              dtype='datetime64[ns]', freq='D')
df.columns # 显示列名
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values # 显示值
array([[0.22527927, 0.73570966, 0.00963855, 0.40814939],
       [0.70500688, 0.90023873, 0.55120699, 0.16547071],
       [0.60806789, 0.33234503, 0.51901872, 0.18194742],
       [0.92195751, 0.62600349, 0.94582788, 0.35721069],
       [0.304423  , 0.83649383, 0.73135067, 0.67894676],
       [0.34286019, 0.05321105, 0.67077744, 0.18654591]])

describe() 显示快速统计:

df.describe()
ABCD
count6.0000006.0000006.0000006.000000
mean0.5179320.5806670.5713030.329712
std0.2713820.3266620.3144470.199088
min0.2252790.0532110.0096390.165471
25%0.3140320.4057600.5270660.183097
50%0.4754640.6808570.6109920.271878
75%0.6807720.8112980.7162070.395415
max0.9219580.9002390.9458280.678947
  • count:计数
  • mean:平均值
  • std:标准差
  • min:最小值
  • 25%:较低的百分位数
  • 50%:中位数
  • 75%:较高的百分位数
  • max:最大值

数据转置:

df.T 
2018-05-01 00:00:002018-05-02 00:00:002018-05-03 00:00:002018-05-04 00:00:002018-05-05 00:00:002018-05-06 00:00:00
A0.2252790.7050070.6080680.9219580.3044230.342860
B0.7357100.9002390.3323450.6260030.8364940.053211
C0.0096390.5512070.5190190.9458280.7313510.670777
D0.4081490.1654710.1819470.3572110.6789470.186546

按轴排序:

df.sort_index(axis=1,ascending=False)
DCBA
2018-05-010.4081490.0096390.7357100.225279
2018-05-020.1654710.5512070.9002390.705007
2018-05-030.1819470.5190190.3323450.608068
2018-05-040.3572110.9458280.6260030.921958
2018-05-050.6789470.7313510.8364940.304423
2018-05-060.1865460.6707770.0532110.342860
  • axis:通常所说的按列排序,0,index;1,columns
  • ascending:布尔值,升序还是降序

按值排序:

df.sort_values('B')
ABCD
2018-05-060.3428600.0532110.6707770.186546
2018-05-030.6080680.3323450.5190190.181947
2018-05-040.9219580.6260030.9458280.357211
2018-05-010.2252790.7357100.0096390.408149
2018-05-050.3044230.8364940.7313510.678947
2018-05-020.7050070.9002390.5512070.165471

切片

选择一个列,产生一个 series ,相当于 df.A

df['A']
2018-05-01    0.225279
2018-05-02    0.705007
2018-05-03    0.608068
2018-05-04    0.921958
2018-05-05    0.304423
2018-05-06    0.342860
Freq: D, Name: A, dtype: float64
df.A
2018-05-01    0.225279
2018-05-02    0.705007
2018-05-03    0.608068
2018-05-04    0.921958
2018-05-05    0.304423
2018-05-06    0.342860
Freq: D, Name: A, dtype: float64
df[:3]
ABCD
2018-05-010.2252790.7357100.0096390.408149
2018-05-020.7050070.9002390.5512070.165471
2018-05-030.6080680.3323450.5190190.181947
按标签选择

使用标签获取横截面的数据:

df.loc[datas[0]]
A    0.225279
B    0.735710
C    0.009639
D    0.408149
Name: 2018-05-01 00:00:00, dtype: float64
df.loc[datas[2]]
A    0.608068
B    0.332345
C    0.519019
D    0.181947
Name: 2018-05-03 00:00:00, dtype: float64

按标签选择多轴:

df.loc[:,['A','B']]
AB
2018-05-010.2252790.735710
2018-05-020.7050070.900239
2018-05-030.6080680.332345
2018-05-040.9219580.626003
2018-05-050.3044230.836494
2018-05-060.3428600.053211

两个端点也包括:

df.loc['20180502':'20180504',['A','C']]
AC
2018-05-020.7050070.551207
2018-05-030.6080680.519019
2018-05-040.9219580.945828
df.loc['20180505',['A','C']]
A    0.304423
C    0.731351
Name: 2018-05-05 00:00:00, dtype: float64

获取某一个值:

df.loc[datas[0],'A']
0.22527926638468565

也可以用 at

df.at[datas[0],'A']
0.22527926638468565
按位置选择

通过传递的整数位置选择:

df.iloc[3]
A    0.921958
B    0.626003
C    0.945828
D    0.357211
Name: 2018-05-04 00:00:00, dtype: float64

切片

df.iloc[0:6,0:3]
ABC
2018-05-010.2252790.7357100.009639
2018-05-020.7050070.9002390.551207
2018-05-030.6080680.3323450.519019
2018-05-040.9219580.6260030.945828
2018-05-050.3044230.8364940.731351
2018-05-060.3428600.0532110.670777

通过整数位置位置列表

df.iloc[[0,2,4],[1,3]]
BD
2018-05-010.7357100.408149
2018-05-030.3323450.181947
2018-05-050.8364940.678947
df.iloc[:,1:3]
BC
2018-05-010.7357100.009639
2018-05-020.9002390.551207
2018-05-030.3323450.519019
2018-05-040.6260030.945828
2018-05-050.8364940.731351
2018-05-060.0532110.670777

获取某一个值:

df.iloc[1,1]
0.9002387294615217

布尔索引

df[df>0.2]
ABCD
2018-05-010.2252790.735710NaN0.408149
2018-05-020.7050070.9002390.551207NaN
2018-05-030.6080680.3323450.519019NaN
2018-05-040.9219580.6260030.9458280.357211
2018-05-050.3044230.8364940.7313510.678947
2018-05-060.342860NaN0.670777NaN

使用 isin() 过滤:

df2 = df.copy()
df2['E'] = ['one','two','three','four','five','six']
df2
ABCDE
2018-05-010.2252790.7357100.0096390.408149one
2018-05-020.7050070.9002390.5512070.165471two
2018-05-030.6080680.3323450.5190190.181947three
2018-05-040.9219580.6260030.9458280.357211four
2018-05-050.3044230.8364940.7313510.678947five
2018-05-060.3428600.0532110.6707770.186546six
df2['E'].isin(['one','five'])
2018-05-01     True
2018-05-02    False
2018-05-03    False
2018-05-04    False
2018-05-05     True
2018-05-06    False
Freq: D, Name: E, dtype: bool
df2[df2['E'].isin(['one','five'])]
ABCDE
2018-05-010.2252790.7357100.0096390.408149one
2018-05-050.3044230.8364940.7313510.678947five
自定义设置值
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20180501',periods=6))
s1
2018-05-01    1
2018-05-02    2
2018-05-03    3
2018-05-04    4
2018-05-05    5
2018-05-06    6
Freq: D, dtype: int64
df['F'] = s1

按照标签设置值

df.loc[datas[0],'A'] = 0

按照位置设置值

df.iloc[0,1] = 0

看一下操作的结果:

df
ABCDF
2018-05-010.0000000.0000000.0096390.4081491
2018-05-020.7050070.9002390.5512070.1654712
2018-05-030.6080680.3323450.5190190.1819473
2018-05-040.9219580.6260030.9458280.3572114
2018-05-050.3044230.8364940.7313510.6789475
2018-05-060.3428600.0532110.6707770.1865466

缺失的数据

pandas 主要使用 np.nan 来表示缺失的数据,默认不包含在计算中。

重建索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。

df1 = df.reindex(index=datas[0:4],columns=list(df.columns) + ['E'])
df1.loc[datas[0]:datas[1],'E'] = 1
df1
ABCDFE
2018-05-010.0000000.0000000.0096390.40814911.0
2018-05-020.7050070.9002390.5512070.16547121.0
2018-05-030.6080680.3323450.5190190.1819473NaN
2018-05-040.9219580.6260030.9458280.3572114NaN

删除任何缺少数据的行:

df1.dropna()
ABCDFE
2018-05-010.0000000.0000000.0096390.40814911.0
2018-05-020.7050070.9002390.5512070.16547121.0

填充数据:

df1.fillna(value=5)
ABCDFE
2018-05-010.0000000.0000000.0096390.40814911.0
2018-05-020.7050070.9002390.5512070.16547121.0
2018-05-030.6080680.3323450.5190190.18194735.0
2018-05-040.9219580.6260030.9458280.35721145.0

判断是否为空,返回布尔值

df1.isna()
ABCDFE
2018-05-01FalseFalseFalseFalseFalseFalse
2018-05-02FalseFalseFalseFalseFalseFalse
2018-05-03FalseFalseFalseFalseFalseTrue
2018-05-04FalseFalseFalseFalseFalseTrue

一些函数的操作

df.mean() # 列统计
A    0.480386
B    0.458049
C    0.571303
D    0.329712
F    3.500000
dtype: float64
df.mean(1) # 行统计
2018-05-01    0.283558
2018-05-02    0.864385
2018-05-03    0.928276
2018-05-04    1.370200
2018-05-05    1.510243
2018-05-06    1.450679
Freq: D, dtype: float64
df.apply(np.cumsum)
ABCDF
2018-05-010.0000000.0000000.0096390.4081491
2018-05-020.7050070.9002390.5608460.5736203
2018-05-031.3130751.2325841.0798640.7555686
2018-05-042.2350321.8585872.0256921.11277810
2018-05-052.5394552.6950812.7570431.79172515
2018-05-062.8823152.7482923.4278201.97827121

这会应用到所有的数据框中。

直方图

s = pd.Series(np.random.randint(1,7,size=9))
s
0    6
1    1
2    1
3    5
4    1
5    3
6    2
7    1
8    1
dtype: int32
s.value_counts()
1    5
6    1
5    1
3    1
2    1
dtype: int64
字符串方法
s = pd.Series(['A','B','C','Al',np.nan,'dOg'])
s.str.lower()
0      a
1      b
2      c
3     al
4    NaN
5    dog
dtype: object
合并
s1 = pd.Series(['a','b'])
s2 = pd.Series(['c','d'])
pd.concat([s1,s2])

0    a
1    b
0    c
1    d
dtype: object

通过将 ignore_index 选项设置为,清除现有索引并在结果中重置它 True 。

pd.concat([s1,s2],ignore_index=True)
0    a
1    b
2    c
3    d
dtype: object

使用该 keys 选项在数据的最外层添加分层索引

pd.concat([s1,s2],keys=['s1','s2'])
s1  0    a
    1    b
s2  0    c
    1    d
dtype: object
pd.concat([s1,s2],keys=['s1','s2'],ignore_index=True)
0    a
1    b
2    c
3    d
dtype: object
合并
left = pd.DataFrame({'key':['foo','bar'],'lval':[1,2]})
right = pd.DataFrame({'key':['foo','bar'],'rval':[3,4]})
left
keylval
0foo1
1bar2
right
keyrval
0foo3
1bar4
pd.merge(left,right)
keylvalrval
0foo13
1bar24
追加

将行追加到数据框:

df = pd.DataFrame(np.random.rand(4,4),columns=['A','B','C','D'])
df
ABCD
00.3042660.5581590.6998050.964887
10.0832970.2289680.8256720.483591
20.4970660.2037180.8949970.830234
30.0011100.3232480.0663820.074556
s = df.iloc[3]
s
A    0.001110
B    0.323248
C    0.066382
D    0.074556
Name: 3, dtype: float64
df.append(s,ignore_index=True)
ABCD
00.3042660.5581590.6998050.964887
10.0832970.2289680.8256720.483591
20.4970660.2037180.8949970.830234
30.0011100.3232480.0663820.074556
40.0011100.3232480.0663820.074556

有点像列表添加元素。

分组

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ....:                           'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three',
   ....:                           'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(8),
   ....:                    'D' : np.random.randn(8)})
   ....: 
df
ABCD
0fooone0.015713-0.276890
1barone0.5635660.089973
2footwo-1.2036562.242553
3barthree-0.254199-1.358523
4footwo-0.6254210.252078
5bartwo0.461810-2.049906
6fooone-1.2721690.447615
7foothree-0.1007210.131472
df.groupby('A').sum()
CD
A
bar0.771176-3.318457
foo-3.1862542.796829
df.groupby(['A','B']).sum()
CD
AB
barone0.5635660.089973
three-0.254199-1.358523
two0.461810-2.049906
fooone-1.2564560.170726
three-0.1007210.131472
two-1.8290772.494631

数据透视表

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
   .....:                    'B' : ['A', 'B', 'C'] * 4,
   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
   .....:                    'D' : np.random.randn(12),
   .....:                    'E' : np.random.randn(12)})
df
ABCDE
0oneAfoo0.167128-0.501473
1oneBfoo-1.218322-0.875397
2twoCfoo-1.327522-1.608971
3threeAbar-0.917783-0.537453
4oneBbar-0.803415-0.129088
5oneCbar-0.6865710.123554
6twoAfoo0.0515451.138850
7threeBfoo0.1386660.396274
8oneCfoo0.8401120.820482
9oneAbar0.452267-1.411540
10twoBbar-1.0002971.037715
11threeCbar2.481947-1.184744
pd.pivot_table(df,values='D',index=['A','B'],columns=['C'])
Cbarfoo
AB
oneA0.4522670.167128
B-0.803415-1.218322
C-0.6865710.840112
threeA-0.917783NaN
BNaN0.138666
C2.481947NaN
twoANaN0.051545
B-1.000297NaN
CNaN-1.327522

数据读取

df.read_csv() # 从CSV读取数据

df.to_excel('foo.xlsx', sheet_name='Sheet1') # 从excel读取数据


评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值