pandas基础(3)

本文详细介绍Pandas库中数据操作的高级技巧,包括数据重塑、数据透视、时间序列处理及类别数据管理,通过实例演示如何高效地进行数据分析与处理。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

# 设置为 inline 风格
%matplotlib inline
# 包导入
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

数据整形

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
AB
firstsecond
barone0.0720260.422077
two-1.099181-0.354796
bazone1.285500-1.185525
two0.645316-0.660115
fooone0.696443-1.664527
two0.718399-0.154125
quxone-0.7400520.713089
two-0.672748-1.346843
df.loc['bar']
AB
second
one0.0720260.422077
two-1.099181-0.354796
df.loc['bar'].loc['one']
A    0.072026
B    0.422077
Name: one, dtype: float64
stacked = df.stack()
stacked
first  second   
bar    one     A    0.072026
               B    0.422077
       two     A   -1.099181
               B   -0.354796
baz    one     A    1.285500
               B   -1.185525
       two     A    0.645316
               B   -0.660115
foo    one     A    0.696443
               B   -1.664527
       two     A    0.718399
               B   -0.154125
qux    one     A   -0.740052
               B    0.713089
       two     A   -0.672748
               B   -1.346843
dtype: float64
stacked.loc['bar'].loc['one'].loc['A']
0.072026163089430537
stacked.unstack()
AB
firstsecond
barone0.0720260.422077
two-1.099181-0.354796
bazone1.285500-1.185525
two0.645316-0.660115
fooone0.696443-1.664527
two0.718399-0.154125
quxone-0.7400520.713089
two-0.672748-1.346843
stacked.unstack().unstack()
AB
secondonetwoonetwo
first
bar0.072026-1.0991810.422077-0.354796
baz1.2855000.645316-1.185525-0.660115
foo0.6964430.718399-1.664527-0.154125
qux-0.740052-0.6727480.713089-1.346843
stacked.unstack(1)
secondonetwo
first
barA0.072026-1.099181
B0.422077-0.354796
bazA1.2855000.645316
B-1.185525-0.660115
fooA0.6964430.718399
B-1.664527-0.154125
quxA-0.740052-0.672748
B0.713089-1.346843

数据透视表

pivot table/轴向旋转表:跟groupby差不多,是其升级版,不同的列里可以定义不同的函数。

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                    'B' : ['A', 'B', 'C'] * 4,
                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D' : np.random.randn(12),
                    'E' : np.random.randn(12)})
df
ABCDE
0oneAfoo1.4775331.557713
1oneBfoo0.0195282.483014
2twoCfoo-0.9124520.409732
3threeAbar0.502807-0.462401
4oneBbar1.709597-1.739413
5oneCbar-0.6581551.302735
6twoAfoo0.0078060.782926
7threeBfoo-0.067922-0.193820
8oneCfoo0.8067130.383870
9oneAbar0.7940170.749756
10twoBbar-0.532554-0.811900
11threeCbar0.4647311.168423
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Cbarfoo
AB
oneA0.7940171.477533
B1.7095970.019528
C-0.6581550.806713
threeA0.502807NaN
BNaN-0.067922
C0.464731NaN
twoANaN0.007806
B-0.532554NaN
CNaN-0.912452
pd.pivot_table(df, values=['E'], index=['A'], columns=['C'])  #一个空里有多个值时显示平均值
E
Cbarfoo
A
one0.1043601.474866
three0.353011-0.193820
two-0.8119000.596329
df[df.A=='one'].groupby('C').mean()
DE
C
bar0.6151530.104360
foo0.7679251.474866

时间序列

pandas 提供了强大的时间序列功能,比如把秒级的股票数据转换为5分钟周期数据等。

rng = pd.date_range('20160301', periods=600, freq='s')
rng
DatetimeIndex(['2016-03-01 00:00:00', '2016-03-01 00:00:01',
               '2016-03-01 00:00:02', '2016-03-01 00:00:03',
               '2016-03-01 00:00:04', '2016-03-01 00:00:05',
               '2016-03-01 00:00:06', '2016-03-01 00:00:07',
               '2016-03-01 00:00:08', '2016-03-01 00:00:09',
               ...
               '2016-03-01 00:09:50', '2016-03-01 00:09:51',
               '2016-03-01 00:09:52', '2016-03-01 00:09:53',
               '2016-03-01 00:09:54', '2016-03-01 00:09:55',
               '2016-03-01 00:09:56', '2016-03-01 00:09:57',
               '2016-03-01 00:09:58', '2016-03-01 00:09:59'],
              dtype='datetime64[ns]', length=600, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
2016-03-01 00:00:00     34
2016-03-01 00:00:01      4
2016-03-01 00:00:02    382
2016-03-01 00:00:03    164
2016-03-01 00:00:04    178
2016-03-01 00:00:05    421
2016-03-01 00:00:06     34
2016-03-01 00:00:07     71
2016-03-01 00:00:08    316
2016-03-01 00:00:09    201
2016-03-01 00:00:10    214
2016-03-01 00:00:11    443
2016-03-01 00:00:12    185
2016-03-01 00:00:13     79
2016-03-01 00:00:14     38
2016-03-01 00:00:15    465
2016-03-01 00:00:16    309
2016-03-01 00:00:17     93
2016-03-01 00:00:18     20
2016-03-01 00:00:19    338
2016-03-01 00:00:20    149
2016-03-01 00:00:21     34
2016-03-01 00:00:22    257
2016-03-01 00:00:23    462
2016-03-01 00:00:24     41
2016-03-01 00:00:25    471
2016-03-01 00:00:26    313
2016-03-01 00:00:27    224
2016-03-01 00:00:28     78
2016-03-01 00:00:29    498
                      ... 
2016-03-01 00:09:30     61
2016-03-01 00:09:31    315
2016-03-01 00:09:32    388
2016-03-01 00:09:33    391
2016-03-01 00:09:34    263
2016-03-01 00:09:35     11
2016-03-01 00:09:36     61
2016-03-01 00:09:37    400
2016-03-01 00:09:38    109
2016-03-01 00:09:39    135
2016-03-01 00:09:40    267
2016-03-01 00:09:41    248
2016-03-01 00:09:42    469
2016-03-01 00:09:43    155
2016-03-01 00:09:44    284
2016-03-01 00:09:45    168
2016-03-01 00:09:46    228
2016-03-01 00:09:47    244
2016-03-01 00:09:48    442
2016-03-01 00:09:49    450
2016-03-01 00:09:50    226
2016-03-01 00:09:51    370
2016-03-01 00:09:52    192
2016-03-01 00:09:53    325
2016-03-01 00:09:54     82
2016-03-01 00:09:55    154
2016-03-01 00:09:56    285
2016-03-01 00:09:57     22
2016-03-01 00:09:58     48
2016-03-01 00:09:59    171
Freq: S, dtype: int32
ts.resample('2Min').sum()
2016-03-01 00:00:00    28595
2016-03-01 00:02:00    29339
2016-03-01 00:04:00    28991
2016-03-01 00:06:00    30789
2016-03-01 00:08:00    30131
Freq: 2T, dtype: int32

时区表达与转换

rng = pd.date_range('20160301', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2016-03-01    1.048036
2016-03-02   -1.232093
2016-03-03    0.519777
2016-03-04    0.213931
2016-03-05    0.184069
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
2016-03-01 00:00:00+00:00    1.048036
2016-03-02 00:00:00+00:00   -1.232093
2016-03-03 00:00:00+00:00    0.519777
2016-03-04 00:00:00+00:00    0.213931
2016-03-05 00:00:00+00:00    0.184069
Freq: D, dtype: float64
ts_utc.tz_convert('Asia/Shanghai')
2016-03-01 08:00:00+08:00    1.048036
2016-03-02 08:00:00+08:00   -1.232093
2016-03-03 08:00:00+08:00    0.519777
2016-03-04 08:00:00+08:00    0.213931
2016-03-05 08:00:00+08:00    0.184069
Freq: D, dtype: float64

在不同的时间表达方式间转换

rng = pd.date_range('20160301', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2016-03-31    1.869095
2016-04-30   -0.698419
2016-05-31   -0.796308
2016-06-30   -1.624937
2016-07-31    0.118491
Freq: M, dtype: float64
ps = ts.to_period()
ps
2016-03    1.869095
2016-04   -0.698419
2016-05   -0.796308
2016-06   -1.624937
2016-07    0.118491
Freq: M, dtype: float64
ps.to_timestamp()
2016-03-01    1.869095
2016-04-01   -0.698419
2016-05-01   -0.796308
2016-06-01   -1.624937
2016-07-01    0.118491
Freq: MS, dtype: float64
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='int64', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts
1990Q1    0.403397
1990Q2    0.052607
1990Q3    0.683487
1990Q4   -0.871958
1991Q1    0.860426
1991Q2   -1.203006
1991Q3   -0.190948
1991Q4   -1.722338
1992Q1    0.093187
1992Q2   -1.732760
1992Q3   -0.039225
1992Q4    1.814737
1993Q1   -0.201548
1993Q2   -0.550538
1993Q3   -0.409734
1993Q4   -0.615150
1994Q1    1.207771
1994Q2   -0.002279
1994Q3    0.105491
1994Q4   -0.182737
1995Q1   -0.083805
1995Q2    0.174109
1995Q3    0.742054
1995Q4    0.620141
1996Q1   -0.471295
1996Q2   -1.926356
1996Q3   -0.631435
1996Q4   -0.218897
1997Q1   -1.792132
1997Q2    0.844161
1997Q3   -0.745867
1997Q4    0.887393
1998Q1    0.558465
1998Q2    0.523789
1998Q3    0.844993
1998Q4    1.329418
1999Q1   -1.554542
1999Q2    1.627259
1999Q3    1.569094
1999Q4    0.035025
2000Q1    1.668087
2000Q2   -0.845356
2000Q3    0.633963
2000Q4   -0.155322
Freq: Q-NOV, dtype: float64
ts.index
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='int64', freq='Q-NOV')
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts
1990-03-01 09:00    0.403397
1990-06-01 09:00    0.052607
1990-09-01 09:00    0.683487
1990-12-01 09:00   -0.871958
1991-03-01 09:00    0.860426
1991-06-01 09:00   -1.203006
1991-09-01 09:00   -0.190948
1991-12-01 09:00   -1.722338
1992-03-01 09:00    0.093187
1992-06-01 09:00   -1.732760
1992-09-01 09:00   -0.039225
1992-12-01 09:00    1.814737
1993-03-01 09:00   -0.201548
1993-06-01 09:00   -0.550538
1993-09-01 09:00   -0.409734
1993-12-01 09:00   -0.615150
1994-03-01 09:00    1.207771
1994-06-01 09:00   -0.002279
1994-09-01 09:00    0.105491
1994-12-01 09:00   -0.182737
1995-03-01 09:00   -0.083805
1995-06-01 09:00    0.174109
1995-09-01 09:00    0.742054
1995-12-01 09:00    0.620141
1996-03-01 09:00   -0.471295
1996-06-01 09:00   -1.926356
1996-09-01 09:00   -0.631435
1996-12-01 09:00   -0.218897
1997-03-01 09:00   -1.792132
1997-06-01 09:00    0.844161
1997-09-01 09:00   -0.745867
1997-12-01 09:00    0.887393
1998-03-01 09:00    0.558465
1998-06-01 09:00    0.523789
1998-09-01 09:00    0.844993
1998-12-01 09:00    1.329418
1999-03-01 09:00   -1.554542
1999-06-01 09:00    1.627259
1999-09-01 09:00    1.569094
1999-12-01 09:00    0.035025
2000-03-01 09:00    1.668087
2000-06-01 09:00   -0.845356
2000-09-01 09:00    0.633963
2000-12-01 09:00   -0.155322
Freq: H, dtype: float64

类别数据

Categorical 是 pandas 0.15 版本才加入的新功能。用来表达类别数据。

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
idraw_grade
01a
12b
23b
34a
45a
56e
df["grade"] = df["raw_grade"].astype("category")
df
idraw_gradegrade
01aa
12bb
23bb
34aa
45aa
56ee
df["grade"].cat.categories
Index([u'a', u'b', u'e'], dtype='object')
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad
df.sort_values(by='grade', ascending=True)
idraw_gradegrade
01avery good
34avery good
45avery good
12bgood
23bgood
56every bad
df.groupby("grade").size()
grade
very good    3
good         2
very bad     1
dtype: int64

数据读写

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df.head()
ABCD
0-1.052421-0.1649923.098604-0.966960
11.1941770.0868800.4960950.265308
20.2977241.284297-0.130855-0.229570
3-0.7870630.5536800.546853-0.322599
40.033174-1.2222810.320090-1.749333

100 rows × 4 columns

df.to_csv('data.csv')
%ls
 Volume in drive C has no label.
 Volume Serial Number is 78AE-8B3A

 Directory of C:\Users\CNJOHUA10\kamidox\work\pandas_tutor

2016-03-14  11:19 AM    <DIR>          .
2016-03-14  11:19 AM    <DIR>          ..
2015-11-16  09:24 AM               746 .gitignore
2016-03-14  09:53 AM    <DIR>          .ipynb_checkpoints
2016-03-14  11:20 AM             6,466 data.csv
2016-03-02  09:50 AM            15,291 ipython_intro.ipynb
2016-03-02  09:50 AM            32,590 numpy_intro.ipynb
2016-03-14  09:50 AM            80,919 pandas_intro_p1.ipynb
2016-03-14  09:50 AM            67,298 pandas_intro_p2.ipynb
2016-03-14  11:19 AM           108,841 pandas_intro_p3.ipynb
2016-02-24  13:20 PM               111 README.md
               8 File(s)        312,262 bytes
               3 Dir(s)  88,233,422,848 bytes free
# pd.read_csv('data.csv')
pd.read_csv('data.csv', index_col=0)  #显示前10行
ABCD
0-1.052421-0.1649923.098604-0.966960
11.1941770.0868800.4960950.265308
20.2977241.284297-0.130855-0.229570
3-0.7870630.5536800.546853-0.322599
40.033174-1.2222810.320090-1.749333
50.1095750.3106841.620296-0.928869
60.761408-0.0276300.458341-0.785370
7-1.150479-0.7185841.0288660.419026
8-2.906881-0.295700-0.342306-0.765172
90.916363-1.181429-1.559657-1.171191
100.5786590.8047261.2994960.176843

100 rows × 4 columns


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值