Pandas Learning

本文介绍Pandas库的基础使用,包括数据结构Series和DataFrame的创建、查看、选择和操作方法,以及缺失数据处理和分组运算,是Python数据分析入门的实用指南。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

pandas网站链接

Object Creation

import pandas as pd
import numpy as np
s = pd.Series([1,3,5, np.NAN,6,8 ])

Series is just like the list in python

s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
s[1]
3.0

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
ABCD
2013-01-01-1.815560-2.0669700.0834460.457541
2013-01-020.878679-1.279713-2.170213-0.317302
2013-01-03-2.5046590.727344-0.0625601.473940
2013-01-040.2456611.1585790.431938-0.511382
2013-01-050.1593381.616634-0.567605-0.221172
2013-01-06-0.758477-1.364958-0.398516-0.241077
df2 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': np.array([3] * 4, dtype='int32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})
df2
ABCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo
df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

View Data

df.head()
ABCD
2013-01-01-1.815560-2.0669700.0834460.457541
2013-01-020.878679-1.279713-2.170213-0.317302
2013-01-03-2.5046590.727344-0.0625601.473940
2013-01-040.2456611.1585790.431938-0.511382
2013-01-050.1593381.616634-0.567605-0.221172
df.tail(2)
ABCD
2013-01-050.1593381.616634-0.567605-0.221172
2013-01-06-0.758477-1.364958-0.398516-0.241077
df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.to_xarray
<bound method NDFrame.to_xarray of                    A         B         C         D
2013-01-01 -1.815560 -2.066970  0.083446  0.457541
2013-01-02  0.878679 -1.279713 -2.170213 -0.317302
2013-01-03 -2.504659  0.727344 -0.062560  1.473940
2013-01-04  0.245661  1.158579  0.431938 -0.511382
2013-01-05  0.159338  1.616634 -0.567605 -0.221172
2013-01-06 -0.758477 -1.364958 -0.398516 -0.241077>
df.describe()
ABCD
count6.0000006.0000006.0000006.000000
mean-0.632503-0.201514-0.4472520.106758
std1.3116651.5501180.9152110.746014
min-2.504659-2.066970-2.170213-0.511382
25%-1.551289-1.343647-0.525333-0.298246
50%-0.299569-0.276185-0.230538-0.231124
75%0.2240801.0507700.0469440.287863
max0.8786791.6166340.4319381.473940
df.T
2013-01-01 00:00:002013-01-02 00:00:002013-01-03 00:00:002013-01-04 00:00:002013-01-05 00:00:002013-01-06 00:00:00
A-1.8155600.878679-2.5046590.2456610.159338-0.758477
B-2.066970-1.2797130.7273441.1585791.616634-1.364958
C0.083446-2.170213-0.0625600.431938-0.567605-0.398516
D0.457541-0.3173021.473940-0.511382-0.221172-0.241077
df.sort_values(by = 'B')
ABCD
2013-01-01-1.815560-2.0669700.0834460.457541
2013-01-06-0.758477-1.364958-0.398516-0.241077
2013-01-020.878679-1.279713-2.170213-0.317302
2013-01-03-2.5046590.727344-0.0625601.473940
2013-01-040.2456611.1585790.431938-0.511382
2013-01-050.1593381.616634-0.567605-0.221172
df.sort_index(axis=1, ascending=False)
DCBA
2013-01-010.4575410.083446-2.066970-1.815560
2013-01-02-0.317302-2.170213-1.2797130.878679
2013-01-031.473940-0.0625600.727344-2.504659
2013-01-04-0.5113820.4319381.1585790.245661
2013-01-05-0.221172-0.5676051.6166340.159338
2013-01-06-0.241077-0.398516-1.364958-0.758477

Selection

df['A']
2013-01-01   -1.815560
2013-01-02    0.878679
2013-01-03   -2.504659
2013-01-04    0.245661
2013-01-05    0.159338
2013-01-06   -0.758477
Freq: D, Name: A, dtype: float64
df[0:3]
ABCD
2013-01-01-1.815560-2.0669700.0834460.457541
2013-01-020.878679-1.279713-2.170213-0.317302
2013-01-03-2.5046590.727344-0.0625601.473940
df['20130102':'20130104']
ABCD
2013-01-020.878679-1.279713-2.170213-0.317302
2013-01-03-2.5046590.727344-0.0625601.473940
2013-01-040.2456611.1585790.431938-0.511382
df.loc[dates[0]]
A   -1.815560
B   -2.066970
C    0.083446
D    0.457541
Name: 2013-01-01 00:00:00, dtype: float64
 df.loc['20130102':'20130104', ['A', 'B']]
AB
2013-01-020.878679-1.279713
2013-01-03-2.5046590.727344
2013-01-040.2456611.158579
df.loc[dates[0], 'A']
-1.8155597588741252
df.iloc[3]
A    0.245661
B    1.158579
C    0.431938
D   -0.511382
Name: 2013-01-04 00:00:00, dtype: float64
df[df.A > 0]
ABCD
2013-01-020.878679-1.279713-2.170213-0.317302
2013-01-040.2456611.1585790.431938-0.511382
2013-01-050.1593381.616634-0.567605-0.221172

Missing data

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1
ABCDE
2013-01-01-1.815560-2.0669700.0834460.457541NaN
2013-01-020.878679-1.279713-2.170213-0.317302NaN
2013-01-03-2.5046590.727344-0.0625601.473940NaN
2013-01-040.2456611.1585790.431938-0.511382NaN
df1.loc[dates[0]:dates[1], 'E'] = 1
df1
ABCDE
2013-01-01-1.815560-2.0669700.0834460.4575411.0
2013-01-020.878679-1.279713-2.170213-0.3173021.0
2013-01-03-2.5046590.727344-0.0625601.473940NaN
2013-01-040.2456611.1585790.431938-0.511382NaN
df1.fillna(value=5)
ABCDE
2013-01-01-1.815560-2.0669700.0834460.4575411.0
2013-01-020.878679-1.279713-2.170213-0.3173021.0
2013-01-03-2.5046590.727344-0.0625601.4739405.0
2013-01-040.2456611.1585790.431938-0.5113825.0

Operation

df.mean()
A   -0.632503
B   -0.201514
C   -0.447252
D    0.106758
dtype: float64
df.mean(1)
2013-01-01   -0.835386
2013-01-02   -0.722137
2013-01-03   -0.091484
2013-01-04    0.331199
2013-01-05    0.246799
2013-01-06   -0.690757
Freq: D, dtype: float64
df.apply(lambda x: x.max() - x.min())
A    3.383339
B    3.683604
C    2.602150
D    1.985322
dtype: float64

Grouping

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})
df
ABCD
0fooone0.445369-0.374983
1barone1.1475220.078017
2footwo0.411960-2.394559
3barthree-1.254990-0.817442
4footwo0.421721-0.667637
5bartwo-1.1538410.159370
6fooone0.170444-0.078069
7foothree-0.3029640.584654
df.groupby('A').sum()
CD
A
bar-1.261309-0.580056
foo1.146529-2.930594
df.groupby(['A', 'B']).sum()
CD
AB
barone1.1475220.078017
three-1.254990-0.817442
two-1.1538410.159370
fooone0.615813-0.453052
three-0.3029640.584654
two0.833681-3.062196

另外,pandas还可以读入csv和excel文件,感觉非常方便,就是python版本的Excel

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值