http://blog.51reboot.com/10%E5%88%86%E9%92%9F%E4%B8%8A%E6%89%8B-python-pandas/
官方 10 分钟入门文档(http://pandas.pydata.org/pandas-docs/stable/10min.html)
这是 pandas 的简短介绍,主要面向新用户。你可以看到更复杂的文档Cookbook(http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook)
[toc]
Environment
- pandas 0.21.0
- Python 3.6
- jupyter notebook
开始
习惯上,我们导入如下:
|
1
2
3
|
import
pandas
as
pd
import
numpy
as
np
import
matplotlib
.
pyplot
as
plt
|
对象创建
具体参阅数据结构介绍
通过传递一个值列表来创建一个 Series,让 pandas 创建一个默认的整数索引:
In [4]: s = pd.Series([1,3,5,np.nan,6,8])</p><br> <p>In [5]: s<br /><br> Out[5]:<br /><br> 0 1.0<br /><br> 1 3.0<br /><br> 2 5.0<br /><br> 3 NaN<br /><br> 4 6.0<br /><br> 5 8.0<br /><br> dtype: float64
|
1
2
3
4
5
6
7
8
9
10
11
|
In
[
4
]
:
s
=
pd
.
Series
(
[
1
,
3
,
5
,
np
.
nan
,
6
,
8
]
)
In
[
5
]
:
s
Out
[
5
]
:
0
1.0
1
3.0
2
5.0
3
NaN
4
6.0
5
8.0
dtype
:
float64
|
通过传递具有日期时间索引和标签列的 numpy 数组来创建一个 DataFrame:
In [6]: dates = pd.date_range('20130101', periods=6)</p><br> <p>In [7]: dates<br /><br> Out[7]:<br /><br> DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',<br /><br> '2013-01-05', '2013-01-06'],<br /><br> dtype='datetime64[ns]', freq='D')</p><br> <p>In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))</p><br> <p>In [9]: df<br /><br> Out[9]:<br /><br> A B C D<br /><br> 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632<br /><br> 2013-01-02 1.212112 -0.173215 0.119209 -1.044236<br /><br> 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804<br /><br> 2013-01-04 0.721555 -0.706771 -1.039575 0.271860<br /><br> 2013-01-05 -0.424972 0.567020 0.276232 -1.087401<br /><br> 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
In
[
6
]
:
dates
=
pd
.
date_range
(
'20130101'
,
periods
=
6
)
In
[
7
]
:
dates
Out
[
7
]
:
DatetimeIndex
(
[
'2013-01-01'
,
'2013-01-02'
,
'2013-01-03'
,
'2013-01-04'
,
'2013-01-05'
,
'2013-01-06'
]
,
dtype
=
'datetime64[ns]'
,
freq
=
'D'
)
In
[
8
]
:
df
=
pd
.
DataFrame
(
np
.
random
.
randn
(
6
,
4
)
,
index
=
dates
,
columns
=
list
(
'ABCD'
)
)
In
[
9
]
:
df
Out
[
9
]
:
A
B
C
D
2013
-
01
-
01
0.469112
-
0.282863
-
1.509059
-
1.135632
2013
-
01
-
02
1.212112
-
0.173215
0.119209
-
1.044236
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
1.071804
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
0.271860
2013
-
01
-
05
-
0.424972
0.567020
0.276232
-
1.087401
2013
-
01
-
06
-
0.673690
0.113648
-
1.478427
0.524988
|
通过传递一个可以转换为一系列对象的字典来创建一个 DataFrame。
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
In
[
10
]
:
df2
=
pd
.
DataFrame
(
{
'A'
:
1.
,
.
.
.
.
:
'B'
:
pd
.
Timestamp
(
'20130102'
)
,
.
.
.
.
:
'C'
:
pd
.
Series
(
1
,
index
=
list
(
range
(
4
)
)
,
dtype
=
'float32'
)
,
.
.
.
.
:
'D'
:
np
.
array
(
[
3
]
*
4
,
dtype
=
'int32'
)
,
.
.
.
.
:
'E'
:
pd
.
Categorical
(
[
"test"
,
"train"
,
"test"
,
"train"
]
)
,
.
.
.
.
:
'F'
:
'foo'
}
)
.
.
.
.
:
In
[
11
]
:
df2
Out
[
11
]
:
A
B
C
D
E
F
0
1.0
2013
-
01
-
02
1.0
3
test
foo
1
1.0
2013
-
01
-
02
1.0
3
train
foo
2
1.0
2013
-
01
-
02
1.0
3
test
foo
3
1.0
2013
-
01
-
02
1.0
3
train
foo
|
有特定的 dtypes
|
1
2
3
4
5
6
7
8
9
|
In
[
12
]
:
df2
.
dtypes
Out
[
12
]
:
A
float64
B
datetime64
[
ns
]
C
float32
D
int32
E
category
F
object
dtype
:
object
|
如果您使用 IPython,按下 TAB 将提示补全。以下是将要完成的属性的子集:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
In
[
13
]
:
df2
.
<
TAB
>
df2
.
A
df2
.
bool
df2
.
abs
df2
.
boxplot
df2
.
add
df2
.
C
df2
.
add_prefix
df2
.
clip
df2
.
add_suffix
df2
.
clip_lower
df2
.
align
df2
.
clip_upper
df2
.
all
df2
.
columns
df2
.
any
df2
.
combine
df2
.
append
df2
.
combine_first
df2
.
apply
df2
.
compound
df2
.
applymap
df2
.
consolidate
df2
.
D
|
如您所见,列 A,B,C 和 D 自动完成。 E 也在那里;为了简洁,其余的属性被省略。
查看数据
具体参阅基本部分(http://pandas.pydata.org/pandas-docs/stable/basics.html#basics)
查看数据集中的最开始和最末尾的行
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
In
[
14
]
:
df
.
head
(
)
Out
[
14
]
:
A
B
C
D
2013
-
01
-
01
0.469112
-
0.282863
-
1.509059
-
1.135632
2013
-
01
-
02
1.212112
-
0.173215
0.119209
-
1.044236
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
1.071804
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
0.271860
2013
-
01
-
05
-
0.424972
0.567020
0.276232
-
1.087401
In
[
15
]
:
df
.
tail
(
3
)
Out
[
15
]
:
A
B
C
D
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
0.271860
2013
-
01
-
05
-
0.424972
0.567020
0.276232
-
1.087401
2013
-
01
-
06
-
0.673690
0.113648
-
1.478427
0.524988
|
显示索引,列和底层 numpy 数据
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
In
[
16
]
:
df
.
index
Out
[
16
]
:
DatetimeIndex
(
[
'2013-01-01'
,
'2013-01-02'
,
'2013-01-03'
,
'2013-01-04'
,
'2013-01-05'
,
'2013-01-06'
]
,
dtype
=
'datetime64[ns]'
,
freq
=
'D'
)
In
[
17
]
:
df
.
columns
Out
[
17
]
:
Index
(
[
'A'
,
'B'
,
'C'
,
'D'
]
,
dtype
=
'object'
)
In
[
18
]
:
df
.
values
Out
[
18
]
:
array
(
[
[
0.4691
,
-
0.2829
,
-
1.5091
,
-
1.1356
]
,
[
1.2121
,
-
0.1732
,
0.1192
,
-
1.0442
]
,
[
-
0.8618
,
-
2.1046
,
-
0.4949
,
1.0718
]
,
[
0.7216
,
-
0.7068
,
-
1.0396
,
0.2719
]
,
[
-
0.425
,
0.567
,
0.2762
,
-
1.0874
]
,
[
-
0.6737
,
0.1136
,
-
1.4784
,
0.525
]
]
)
|
描述显示您的数据的快速统计结果( std 是标准偏差)
|
1
2
3
4
5
6
7
8
9
10
11
|
In
[
19
]
:
df
.
describe
(
)
Out
[
19
]
:
A
B
C
D
count
6.000000
6.000000
6.000000
6.000000
mean
0.073711
-
0.431125
-
0.687758
-
0.233103
std
0.843157
0.922818
0.779887
0.973118
min
-
0.861849
-
2.104569
-
1.509059
-
1.135632
25
%
-
0.611510
-
0.600794
-
1.368714
-
1.076610
50
%
0.022070
-
0.228039
-
0.767252
-
0.386188
75
%
0.658444
0.041933
-
0.034326
0.461706
max
1.212112
0.567020
0.276232
1.071804
|
转置数据
|
1
2
3
4
5
6
7
|
In
[
20
]
:
df
.
T
Out
[
20
]
:
2013
-
01
-
01
2013
-
01
-
02
2013
-
01
-
03
2013
-
01
-
04
2013
-
01
-
05
2013
-
01
-
06
A
0.469112
1.212112
-
0.861849
0.721555
-
0.424972
-
0.673690
B
-
0.282863
-
0.173215
-
2.104569
-
0.706771
0.567020
0.113648
C
-
1.509059
0.119209
-
0.494929
-
1.039575
0.276232
-
1.478427
D
-
1.135632
-
1.044236
1.071804
0.271860
-
1.087401
0.524988
|
按轴排序
|
1
2
3
4
5
6
7
8
9
|
In
[
21
]
:
df
.
sort_index
(
axis
=
1
,
ascending
=
False
)
Out
[
21
]
:
D
C
B
A
2013
-
01
-
01
-
1.135632
-
1.509059
-
0.282863
0.469112
2013
-
01
-
02
-
1.044236
0.119209
-
0.173215
1.212112
2013
-
01
-
03
1.071804
-
0.494929
-
2.104569
-
0.861849
2013
-
01
-
04
0.271860
-
1.039575
-
0.706771
0.721555
2013
-
01
-
05
-
1.087401
0.276232
0.567020
-
0.424972
2013
-
01
-
06
0.524988
-
1.478427
0.113648
-
0.673690
|
按值排序
|
1
2
3
4
5
6
7
8
9
|
In
[
22
]
:
df
.
sort_values
(
by
=
'B'
)
Out
[
22
]
:
A
B
C
D
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
1.071804
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
0.271860
2013
-
01
-
01
0.469112
-
0.282863
-
1.509059
-
1.135632
2013
-
01
-
02
1.212112
-
0.173215
0.119209
-
1.044236
2013
-
01
-
06
-
0.673690
0.113648
-
1.478427
0.524988
2013
-
01
-
05
-
0.424972
0.567020
0.276232
-
1.087401
|
选择
请参阅索引文档索引和选择数据(http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing)和多索引/高级索引(http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced)
直接选择
选择一个产生 Series 的列,相当于 df.A
|
1
2
3
4
5
6
7
8
9
|
In
[
23
]
:
df
[
'A'
]
Out
[
23
]
:
2013
-
01
-
01
0.469112
2013
-
01
-
02
1.212112
2013
-
01
-
03
-
0.861849
2013
-
01
-
04
0.721555
2013
-
01
-
05
-
0.424972
2013
-
01
-
06
-
0.673690
Freq
:
D
,
Name
:
A
,
dtype
:
float64
|
选择通过 [] ,哪些切片的行。
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
In
[
24
]
:
df
[
0
:
3
]
Out
[
24
]
:
A
B
C
D
2013
-
01
-
01
0.469112
-
0.282863
-
1.509059
-
1.135632
2013
-
01
-
02
1.212112
-
0.173215
0.119209
-
1.044236
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
1.071804
In
[
25
]
:
df
[
'20130102'
:
'20130104'
]
Out
[
25
]
:
A
B
C
D
2013
-
01
-
02
1.212112
-
0.173215
0.119209
-
1.044236
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
1.071804
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
0.271860
|
按标签选择
请参阅按标签选择(http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label)
使用标签获取整行数据
|
1
2
3
4
5
6
7
|
In
[
26
]
:
df
.
loc
[
dates
[
0
]
]
Out
[
26
]
:
A
0.469112
B
-
0.282863
C
-
1.509059
D
-
1.135632
Name
:
2013
-
01
-
01
00
:
00
:
00
,
dtype
:
float64
|
通过标签选择多列
|
1
2
3
4
5
6
7
8
9
|
In
[
27
]
:
df
.
loc
[
:
,
[
'A'
,
'B'
]
]
Out
[
27
]
:
A
B
2013
-
01
-
01
0.469112
-
0.282863
2013
-
01
-
02
1.212112
-
0.173215
2013
-
01
-
03
-
0.861849
-
2.104569
2013
-
01
-
04
0.721555
-
0.706771
2013
-
01
-
05
-
0.424972
0.567020
2013
-
01
-
06
-
0.673690
0.113648
|
显示标签切片,包括两个端点
|
1
2
3
4
5
6
|
In
[
28
]
:
df
.
loc
[
'20130102'
:
'20130104'
,
[
'A'
,
'B'
]
]
Out
[
28
]
:
A
B
2013
-
01
-
02
1.212112
-
0.173215
2013
-
01
-
03
-
0.861849
-
2.104569
2013
-
01
-
04
0.721555
-
0.706771
|
减少返回的对象的维度
|
1
2
3
4
5
|
In
[
29
]
:
df
.
loc
[
'20130102'
,
[
'A'
,
'B'
]
]
Out
[
29
]
:
A
1.212112
B
-
0.173215
Name
:
2013
-
01
-
02
00
:
00
:
00
,
dtype
:
float64
|
获得标量值
|
1
2
3
4
5
|
In
[
30
]
:
df
.
loc
[
dates
[
0
]
,
'A'
]
Out
[
30
]
:
0.46911229990718628
快速访问标量(等同于之前的方法)
In
[
31
]
:
df
.
at
[
dates
[
0
]
,
'A'
]
Out
[
31
]
:
0.46911229990718628
|
按位置选择
请参阅按位置选择(http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer)
通过传入整数的位置进行选择
|
1
2
3
4
5
6
7
|
In
[
32
]
:
df
.
iloc
[
3
]
Out
[
32
]
:
A
0.721555
B
-
0.706771
C
-
1.039575
D
0.271860
Name
:
2013
-
01
-
04
00
:
00
:
00
,
dtype
:
float64
|
通过整数片,类似于 numpy / Python
|
1
2
3
4
5
|
In
[
33
]
:
df
.
iloc
[
3
:
5
,
0
:
2
]
Out
[
33
]
:
A
B
2013
-
01
-
04
0.721555
-
0.706771
2013
-
01
-
05
-
0.424972
0.567020
|
整数位置的位置列表,类似于 numpy / Python 风格
|
1
2
3
4
5
6
|
In
[
34
]
:
df
.
iloc
[
[
1
,
2
,
4
]
,
[
0
,
2
]
]
Out
[
34
]
:
A
C
2013
-
01
-
02
1.212112
0.119209
2013
-
01
-
03
-
0.861849
-
0.494929
2013
-
01
-
05
-
0.424972
0.276232
|
用于明确地切割行
|
1
2
3
4
5
|
In
[
35
]
:
df
.
iloc
[
1
:
3
,
:
]
Out
[
35
]
:
A
B
C
D
2013
-
01
-
02
1.212112
-
0.173215
0.119209
-
1.044236
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
1.071804
|
用于明确地切分列
|
1
2
3
4
5
6
7
8
9
|
In
[
36
]
:
df
.
iloc
[
:
,
1
:
3
]
Out
[
36
]
:
B
C
2013
-
01
-
01
-
0.282863
-
1.509059
2013
-
01
-
02
-
0.173215
0.119209
2013
-
01
-
03
-
2.104569
-
0.494929
2013
-
01
-
04
-
0.706771
-
1.039575
2013
-
01
-
05
0.567020
0.276232
2013
-
01
-
06
0.113648
-
1.478427
|
为了明确地获取一个值
|
1
2
|
In
[
37
]
:
df
.
iloc
[
1
,
1
]
Out
[
37
]
:
-
0.17321464905330858
|
为了快速访问标量(等同于之前的方法)
|
1
2
|
In
[
38
]
:
df
.
iat
[
1
,
1
]
Out
[
38
]
:
-
0.17321464905330858
|
布尔索引
使用单个列的值来选择数据。
|
1
2
3
4
5
6
|
In
[
39
]
:
df
[
df
.
A
>
0
]
Out
[
39
]
:
A
B
C
D
2013
-
01
-
01
0.469112
-
0.282863
-
1.509059
-
1.135632
2013
-
01
-
02
1.212112
-
0.173215
0.119209
-
1.044236
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
0.271860
|
从满足布尔条件的 DataFrame 中选择值。
|
1
2
3
4
5
6
7
8
9
|
In
[
40
]
:
df
[
df
>
0
]
Out
[
40
]
:
A
B
C
D
2013
-
01
-
01
0.469112
NaN
NaN
NaN
2013
-
01
-
02
1.212112
NaN
0.119209
NaN
2013
-
01
-
03
NaN
NaN
NaN
1.071804
2013
-
01
-
04
0.721555
NaN
NaN
0.271860
2013
-
01
-
05
NaN
0.567020
0.276232
NaN
2013
-
01
-
06
NaN
0.113648
NaN
0.524988
|
使用 isin()方法进行过滤:
In [41]: df2 = df.copy()</p><br> <p>In [42]: df2['E'] = ['one', 'one','two','three','four','three']</p><br> <p>In [43]: df2<br /><br> Out[43]:<br /><br> A B C D E<br /><br> 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one<br /><br> 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one<br /><br> 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two<br /><br> 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three<br /><br> 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four<br /><br> 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three</p><br> <p>In [44]: df2[df2['E'].isin(['two','four'])]<br /><br> Out[44]:<br /><br> A B C D E<br /><br> 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two<br /><br> 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
In
[
41
]
:
df2
=
df
.
copy
(
)
In
[
42
]
:
df2
[
'E'
]
=
[
'one'
,
'one'
,
'two'
,
'three'
,
'four'
,
'three'
]
In
[
43
]
:
df2
Out
[
43
]
:
A
B
C
D
E
2013
-
01
-
01
0.469112
-
0.282863
-
1.509059
-
1.135632
one
2013
-
01
-
02
1.212112
-
0.173215
0.119209
-
1.044236
one
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
1.071804
two
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
0.271860
three
2013
-
01
-
05
-
0.424972
0.567020
0.276232
-
1.087401
four
2013
-
01
-
06
-
0.673690
0.113648
-
1.478427
0.524988
three
In
[
44
]
:
df2
[
df2
[
'E'
]
.
isin
(
[
'two'
,
'four'
]
)
]
Out
[
44
]
:
A
B
C
D
E
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
1.071804
two
2013
-
01
-
05
-
0.424972
0.567020
0.276232
-
1.087401
four
|
设置
设置新列自动按索引排列数据
In [45]: s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))</p><br> <p>In [46]: s1<br /><br> Out[46]:<br /><br> 2013-01-02 1<br /><br> 2013-01-03 2<br /><br> 2013-01-04 3<br /><br> 2013-01-05 4<br /><br> 2013-01-06 5<br /><br> 2013-01-07 6<br /><br> Freq: D, dtype: int64</p><br> <p>In [47]: df['F'] = s1
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
In
[
45
]
:
s1
=
pd
.
Series
(
[
1
,
2
,
3
,
4
,
5
,
6
]
,
index
=
pd
.
date_range
(
'20130102'
,
periods
=
6
)
)
In
[
46
]
:
s1
Out
[
46
]
:
2013
-
01
-
02
1
2013
-
01
-
03
2
2013
-
01
-
04
3
2013
-
01
-
05
4
2013
-
01
-
06
5
2013
-
01
-
07
6
Freq
:
D
,
dtype
:
int64
In
[
47
]
:
df
[
'F'
]
=
s1
|
通过标签设置值
|
1
|
In
[
48
]
:
df
.
at
[
dates
[
0
]
,
'A'
]
=
0
|
按位置设置值
|
1
|
In
[
49
]
:
df
.
iat
[
0
,
1
]
=
0
|
通过分配一个 numpy 数组进行设置
|
1
|
In
[
50
]
:
df
.
loc
[
:
,
'D'
]
=
np
.
array
(
[
5
]
*
len
(
df
)
)
|
事先设置操作的结果
|
1
2
3
4
5
6
7
8
9
|
In
[
51
]
:
df
Out
[
51
]
:
A
B
C
D
F
2013
-
01
-
01
0.000000
0.000000
-
1.509059
5
NaN
2013
-
01
-
02
1.212112
-
0.173215
0.119209
5
1.0
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
5
2.0
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
5
3.0
2013
-
01
-
05
-
0.424972
0.567020
0.276232
5
4.0
2013
-
01
-
06
-
0.673690
0.113648
-
1.478427
5
5.0
|
一个 where 操作与设置。
缺失数据
熊猫主要使用值 np.nan 来表示缺失的数据。这是默认情况下不包括在计算中。查看缺失数据(http://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data)
Reindexing 允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。
In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])</p><br> <p>In [56]: df1.loc[dates[0]:dates[1],'E'] = 1</p><br> <p>In [57]: df1<br /><br> Out[57]:<br /><br> A B C D F E<br /><br> 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0<br /><br> 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0<br /><br> 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN<br /><br> 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN
|
1
2
3
4
5
6
7
8
9
10
11
|
In
[
55
]
:
df1
=
df
.
reindex
(
index
=
dates
[
0
:
4
]
,
columns
=
list
(
df
.
columns
)
+
[
'E'
]
)
In
[
56
]
:
df1
.
loc
[
dates
[
0
]
:
dates
[
1
]
,
'E'
]
=
1
In
[
57
]
:
df1
Out
[
57
]
:
A
B
C
D
F
E
2013
-
01
-
01
0.000000
0.000000
-
1.509059
5
NaN
1.0
2013
-
01
-
02
1.212112
-
0.173215
0.119209
5
1.0
1.0
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
5
2.0
NaN
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
5
3.0
NaN
|
删除任何缺少数据的行。
|
1
2
3
4
|
In
[
58
]
:
df1
.
dropna
(
how
=
'any'
)
Out
[
58
]
:
A
B
C
D
F
E
2013
-
01
-
02
1.212112
-
0.173215
0.119209
5
1.0
1.0
|
填写缺少的数据
|
1
2
3
4
5
6
7
|
In
[
59
]
:
df1
.
fillna
(
value
=
5
)
Out
[
59
]
:
A
B
C
D
F
E
2013
-
01
-
01
0.000000
0.000000
-
1.509059
5
5.0
1.0
2013
-
01
-
02
1.212112
-
0.173215
0.119209
5
1.0
1.0
2013
-
01
-
03
-
0.861849
-
2.104569
-
0.494929
5
2.0
5.0
2013
-
01
-
04
0.721555
-
0.706771
-
1.039575
5
3.0
5.0
|
获取值为 nan 的布尔值
|
1
2
3
4
5
6
7
|
In
[
60
]
:
pd
.
isna
(
df1
)
Out
[
60
]
:
A
B
C
D
F
E
2013
-
01
-
01
False
False
False
False
True
False
2013
-
01
-
02
False
False
False
False
False
False
2013
-
01
-
03
False
False
False
False
False
True
2013
-
01
-
04
False
False
False
False
False
True
|
操作
请参阅 Basic p on Binary Ops(http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-binop)
统计
一般操作不包括丢失的数据。
执行描述性统计
|
1
2
3
4
5
6
7
8
|
In
[
61
]
:
df
.
mean
(
)
Out
[
61
]
:
A
-
0.004474
B
-
0.383981
C
-
0.687758
D
5.000000
F
3.000000
dtype
:
float64
|
相同的操作在另一个轴上
|
1
2
3
4
5
6
7
8
9
|
In
[
62
]
:
df
.
mean
(
1
)
Out
[
62
]
:
2013
-
01
-
01
0.872735
2013
-
01
-
02
1.431621
2013
-
01
-
03
0.707731
2013
-
01
-
04
1.395042
2013
-
01
-
05
1.883656
2013
-
01
-
06
1.592306
Freq
:
D
,
dtype
:
float64
|
使用具有不同维度和需要对齐的对象进行操作。另外,大熊猫会沿指定的尺寸自动变化。
In [63]: s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)</p><br> <p>In [64]: s<br /><br> Out[64]:<br /><br> 2013-01-01 NaN<br /><br> 2013-01-02 NaN<br /><br> 2013-01-03 1.0<br /><br> 2013-01-04 3.0<br /><br> 2013-01-05 5.0<br /><br> 2013-01-06 NaN<br /><br> Freq: D, dtype: float64</p><br> <p>In [65]: df.sub(s, axis='index')<br /><br> Out[65]:<br /><br> A B C D F<br /><br> 2013-01-01 NaN NaN NaN NaN NaN<br /><br> 2013-01-02 NaN NaN NaN NaN NaN<br /><br> 2013-01-03 -1.861849 -3.104569 -1.494929 4.0 1.0<br /><br> 2013-01-04 -2.278445 -3.706771 -4.039575 2.0 0.0<br /><br> 2013-01-05 -5.424972 -4.432980 -4.723768 0.0 -1.0<br /><br> 2013-01-06 NaN NaN NaN NaN NaN
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
In
[
63
]
:
s
=
pd
.
Series
(
[
1
,
3
,
5
,
np
.
nan
,
6
,
8
]
,
index
=
dates
)
.
shift
(
2
)
In
[
64
]
:
s
Out
[
64
]
:
2013
-
01
-
01
NaN
2013
-
01
-
02
NaN
2013
-
01
-
03
1.0
2013
-
01
-
04
3.0
2013
-
01
-
05
5.0
2013
-
01
-
06
NaN
Freq
:
D
,
dtype
:
float64
In
[
65
]
:
df
.
sub
(
s
,
axis
=
'index'
)
Out
[
65
]
:
A
B
C
D
F
2013
-
01
-
01
NaN
NaN
NaN
NaN
NaN
2013
-
01
-
02
NaN
NaN
NaN
NaN
NaN
2013
-
01
-
03
-
1.861849
-
3.104569
-
1.494929
4.0
1.0
2013
-
01
-
04
-
2.278445
-
3.706771
-
4.039575
2.0
0.0
2013
-
01
-
05
-
5.424972
-
4.432980
-
4.723768
0.0
-
1.0
2013
-
01
-
06
NaN
NaN
NaN
NaN
NaN
|
应用(apply)
将函数应用于数据
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
In
[
66
]
:
df
.
apply
(
np
.
cumsum
)
Out
[
66
]
:
A
B
C
D
F
2013
-
01
-
01
0.000000
0.000000
-
1.509059
5
NaN
2013
-
01
-
02
1.212112
-
0.173215
-
1.389850
10
1.0
2013
-
01
-
03
0.350263
-
2.277784
-
1.884779
15
3.0
2013
-
01
-
04
1.071818
-
2.984555
-
2.924354
20
6.0
2013
-
01
-
05
0.646846
-
2.417535
-
2.648122
25
10.0
2013
-
01
-
06
-
0.026844
-
2.303886
-
4.126549
30
15.0
In
[
67
]
:
df
.
apply
(
lambda
x
:
x
.
max
(
)
-
x
.
min
(
)
)
Out
[
67
]
:
A
2.073961
B
2.671590
C
1.785291
D
0.000000
F
4.000000
dtype
:
float64
|
直方图化(Histogramming)
请参阅 Histogramming and Discretization(http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-discretization)
In [68]: s = pd.Series(np.random.randint(0, 7, size=10))</p><br> <p>In [69]: s<br /><br> Out[69]:<br /><br> 0 4<br /><br> 1 2<br /><br> 2 1<br /><br> 3 2<br /><br> 4 6<br /><br> 5 4<br /><br> 6 4<br /><br> 7 6<br /><br> 8 4<br /><br> 9 4<br /><br> dtype: int64</p><br> <p>In [70]: s.value_counts()<br /><br> Out[70]:<br /><br> 4 5<br /><br> 6 2<br /><br> 2 2<br /><br> 1 1<br /><br> dtype: int64
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
In
[
68
]
:
s
=
pd
.
Series
(
np
.
random
.
randint
(
0
,
7
,
size
=
10
)
)
In
[
69
]
:
s
Out
[
69
]
:
0
4
1
2
2
1
3
2
4
6
5
4
6
4
7
6
8
4
9
4
dtype
:
int64
In
[
70
]
:
s
.
value_counts
(
)
Out
[
70
]
:
4
5
6
2
2
2
1
1
dtype
:
int64
|
字符串方法
Series 在 str 属性中配备了一组字符串处理方法,使得在数组的每个元素上操作都变得很容易,如下面的代码片段所示。请注意,str中的模式匹配通常默认使用正则表达式(在某些情况下始终使用它们)。在矢量化字符串方法(http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods)中查看更多。
In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])</p><br> <p>In [72]: s.str.lower()<br /><br> Out[72]:<br /><br> 0 a<br /><br> 1 b<br /><br> 2 c<br /><br> 3 aaba<br /><br> 4 baca<br /><br> 5 NaN<br /><br> 6 caba<br /><br> 7 dog<br /><br> 8 cat<br /><br> dtype: object
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
In
[
71
]
:
s
=
pd
.
Series
(
[
'A'
,
'B'
,
'C'
,
'Aaba'
,
'Baca'
,
np
.
nan
,
'CABA'
,
'dog'
,
'cat'
]
)
In
[
72
]
:
s
.
str
.
lower
(
)
Out
[
72
]
:
0
a
1
b
2
c
3
aaba
4
baca
5
NaN
6
caba
7
dog
8
cat
dtype
:
object
|
合并
Concat
在连接/合并类型操作的情况下,熊猫提供了各种功能,可以方便地将 Series,DataFrame 和 Panel 对象与索引和关系代数功能的各种设置逻辑组合在一起。
请参阅合并部分(http://pandas.pydata.org/pandas-docs/stable/merging.html#merging)
连接 pandas 对象和 concat():
In [73]: df = pd.DataFrame(np.random.randn(10, 4))</p><br> <p>In [74]: df<br /><br> Out[74]:<br /><br> 0 1 2 3<br /><br> 0 -0.548702 1.467327 -1.015962 -0.483075<br /><br> 1 1.637550 -1.217659 -0.291519 -1.745505<br /><br> 2 -0.263952 0.991460 -0.919069 0.266046<br /><br> 3 -0.709661 1.669052 1.037882 -1.705775<br /><br> 4 -0.919854 -0.042379 1.247642 -0.009920<br /><br> 5 0.290213 0.495767 0.362949 1.548106<br /><br> 6 -1.131345 -0.089329 0.337863 -0.945867<br /><br> 7 -0.932132 1.956030 0.017587 -0.016692<br /><br> 8 -0.575247 0.254161 -1.143704 0.215897<br /><br> 9 1.193555 -0.077118 -0.408530 -0.862495</p><br> <p># break it into pieces<br /><br> In [75]: pieces = [df[:3], df[3:7], df[7:]]</p><br> <p>In [76]: pd.concat(pieces)<br /><br> Out[76]:<br /><br> 0 1 2 3<br /><br> 0 -0.548702 1.467327 -1.015962 -0.483075<br /><br> 1 1.637550 -1.217659 -0.291519 -1.745505<br /><br> 2 -0.263952 0.991460 -0.919069 0.266046<br /><br> 3 -0.709661 1.669052 1.037882 -1.705775<br /><br> 4 -0.919854 -0.042379 1.247642 -0.009920<br /><br> 5 0.290213 0.495767 0.362949 1.548106<br /><br> 6 -1.131345 -0.089329 0.337863 -0.945867<br /><br> 7 -0.932132 1.956030 0.017587 -0.016692<br /><br> 8 -0.575247 0.254161 -1.143704 0.215897<br /><br> 9 1.193555 -0.077118 -0.408530 -0.862495
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
In
[
73
]
:
df
=
pd
.
DataFrame
(
np
.
random
.
randn
(
10
,
4
)
)
In
[
74
]
:
df
Out
[
74
]
:
0
1
2
3
0
-
0.548702
1.467327
-
1.015962
-
0.483075
1
1.637550
-
1.217659
-
0.291519
-
1.745505
2
-
0.263952
0.991460
-
0.919069
0.266046
3
-
0.709661
1.669052
1.037882
-
1.705775
4
-
0.919854
-
0.042379
1.247642
-
0.009920
5
0.290213
0.495767
0.362949
1.548106
6
-
1.131345
-
0.089329
0.337863
-
0.945867
7
-
0.932132
1.956030
0.017587
-
0.016692
8
-
0.575247
0.254161
-
1.143704
0.215897
9
1.193555
-
0.077118
-
0.408530
-
0.862495
# break it into pieces
In
[
75
]
:
pieces
=
[
df
[
:
3
]
,
df
[
3
:
7
]
,
df
[
7
:
]
]
In
[
76
]
:
pd
.
concat
(
pieces
)
Out
[
76
]
:
0
1
2
3
0
-
0.548702
1.467327
-
1.015962
-
0.483075
1
1.637550
-
1.217659
-
0.291519
-
1.745505
2
-
0.263952
0.991460
-
0.919069
0.266046
3
-
0.709661
1.669052
1.037882
-
1.705775
4
-
0.919854
-
0.042379
1.247642
-
0.009920
5
0.290213
0.495767
0.362949
1.548106
6
-
1.131345
-
0.089329
0.337863
-
0.945867
7
-
0.932132
1.956030
0.017587
-
0.016692
8
-
0.575247
0.254161
-
1.143704
0.215897
9
1.193555
-
0.077118
-
0.408530
-
0.862495
|
Join
SQL 风格合并。请参阅数据库样式的 joining
In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})</p><br> <p>In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})</p><br> <p>In [79]: left<br /><br> Out[79]:<br /><br> key lval<br /><br> 0 foo 1<br /><br> 1 foo 2</p><br> <p>In [80]: right<br /><br> Out[80]:<br /><br> key rval<br /><br> 0 foo 4<br /><br> 1 foo 5</p><br> <p>In [81]: pd.merge(left, right, on='key')<br /><br> Out[81]:<br /><br> key lval rval<br /><br> 0 foo 1 4<br /><br> 1 foo 1 5<br /><br> 2 foo 2 4<br /><br> 3 foo 2 5
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
In
[
77
]
:
left
=
pd
.
DataFrame
(
{
'key'
:
[
'foo'
,
'foo'
]
,
'lval'
:
[
1
,
2
]
}
)
In
[
78
]
:
right
=
pd
.
DataFrame
(
{
'key'
:
[
'foo'
,
'foo'
]
,
'rval'
:
[
4
,
5
]
}
)
In
[
79
]
:
left
Out
[
79
]
:
key
lval
0
foo
1
1
foo
2
In
[
80
]
:
right
Out
[
80
]
:
key
rval
0
foo
4
1
foo
5
In
[
81
]
:
pd
.
merge
(
left
,
right
,
on
=
'key'
)
Out
[
81
]
:
key
lval
rval
0
foo
1
4
1
foo
1
5
2
foo
2
4
3
foo
2
5
|
另一个可以给出的例子是:
In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})</p><br> <p>In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})</p><br> <p>In [84]: left<br /><br> Out[84]:<br /><br> key lval<br /><br> 0 foo 1<br /><br> 1 bar 2</p><br> <p>In [85]: right<br /><br> Out[85]:<br /><br> key rval<br /><br> 0 foo 4<br /><br> 1 bar 5</p><br> <p>In [86]: pd.merge(left, right, on='key')<br /><br> Out[86]:<br /><br> key lval rval<br /><br> 0 foo 1 4<br /><br> 1 bar 2 5
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
In
[
82
]
:
left
=
pd
.
DataFrame
(
{
'key'
:
[
'foo'
,
'bar'
]
,
'lval'
:
[
1
,
2
]
}
)
In
[
83
]
:
right
=
pd
.
DataFrame
(
{
'key'
:
[
'foo'
,
'bar'
]
,
'rval'
:
[
4
,
5
]
}
)
In
[
84
]
:
left
Out
[
84
]
:
key
lval
0
foo
1
1
bar
2
In
[
85
]
:
right
Out
[
85
]
:
key
rval
0
foo
4
1
bar
5
In
[
86
]
:
pd
.
merge
(
left
,
right
,
on
=
'key'
)
Out
[
86
]
:
key
lval
rval
0
foo
1
4
1
bar
2
5
|
Append
将行附加到数据框。见 Appending
In [87]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])</p><br> <p>In [88]: df<br /><br> Out[88]:<br /><br> A B C D<br /><br> 0 1.346061 1.511763 1.627081 -0.990582<br /><br> 1 -0.441652 1.211526 0.268520 0.024580<br /><br> 2 -1.577585 0.396823 -0.105381 -0.532532<br /><br> 3 1.453749 1.208843 -0.080952 -0.264610<br /><br> 4 -0.727965 -0.589346 0.339969 -0.693205<br /><br> 5 -0.339355 0.593616 0.884345 1.591431<br /><br> 6 0.141809 0.220390 0.435589 0.192451<br /><br> 7 -0.096701 0.803351 1.715071 -0.708758</p><br> <p>In [89]: s = df.iloc[3]</p><br> <p>In [90]: df.append(s, ignore_index=True)<br /><br> Out[90]:<br /><br> A B C D<br /><br> 0 1.346061 1.511763 1.627081 -0.990582<br /><br> 1 -0.441652 1.211526 0.268520 0.024580<br /><br> 2 -1.577585 0.396823 -0.105381 -0.532532<br /><br> 3 1.453749 1.208843 -0.080952 -0.264610<br /><br> 4 -0.727965 -0.589346 0.339969 -0.693205<br /><br> 5 -0.339355 0.593616 0.884345 1.591431<br /><br> 6 0.141809 0.220390 0.435589 0.192451<br /><br> 7 -0.096701 0.803351 1.715071 -0.708758<br /><br> 8 1.453749 1.208843 -0.080952 -0.264610
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
In
[
87
]
:
df
=
pd
.
DataFrame
(
np
.
random
.
randn
(
8
,
4
)
,
columns
=
[
'A'
,
'B'
,
'C'
,
'D'
]
)
In
[
88
]
:
df
Out
[
88
]
:
A
B
C
D
0
1.346061
1.511763
1.627081
-
0.990582
1
-
0.441652
1.211526
0.268520
0.024580
2
-
1.577585
0.396823
-
0.105381
-
0.532532
3
1.453749
1.208843
-
0.080952
-
0.264610
4
-
0.727965
-
0.589346
0.339969
-
0.693205
5
-
0.339355
0.593616
0.884345
1.591431
6
0.141809
0.220390
0.435589
0.192451
7
-
0.096701
0.803351
1.715071
-
0.708758
In
[
89
]
:
s
=
df
.
iloc
[
3
]
In
[
90
]
:
df
.
append
(
s
,
ignore_index
=
True
)
Out
[
90
]
:
A
B
C
D
0
1.346061
1.511763
1.627081
-
0.990582
1
-
0.441652
1.211526
0.268520
0.024580
2
-
1.577585
0.396823
-
0.105381
-
0.532532
3
1.453749
1.208843
-
0.080952
-
0.264610
4
-
0.727965
-
0.589346
0.339969
-
0.693205
5
-
0.339355
0.593616
0.884345
1.591431
6
0.141809
0.220390
0.435589
0.192451
7
-
0.096701
0.803351
1.715071
-
0.708758
8
1.453749
1.208843
-
0.080952
-
0.264610
|
分类
通过 “group by”,我们指的是涉及一个或多个以下步骤的过程
- Splitting 根据一些标准将数据分组
- Applying 根据一些标准将数据分组
- Combining 将结果组合成一个数据结构
请参阅分组部分(http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
In
[
91
]
:
df
=
pd
.
DataFrame
(
{
'A'
:
[
'foo'
,
'bar'
,
'foo'
,
'bar'
,
.
.
.
.
:
'foo'
,
'bar'
,
'foo'
,
'foo'
]
,
.
.
.
.
:
'B'
:
[
'one'
,
'one'
,
'two'
,
'three'
,
.
.
.
.
:
'two'
,
'two'
,
'one'
,
'three'
]
,
.
.
.
.
:
'C'
:
np
.
random
.
randn
(
8
)
,
.
.
.
.
:
'D'
:
np
.
random
.
randn
(
8
)
}
)
.
.
.
.
:
In
[
92
]
:
df
Out
[
92
]
:
A
B
C
D
0
foo
one
-
1.202872
-
0.055224
1
bar
one
-
1.814470
2.395985
2
foo
two
1.018601
1.552825
3
bar
three
-
0.595447
0.166599
4
foo
two
1.395433
0.047609
5
bar
two
-
0.392670
-
0.136473
6
foo
one
0.007207
-
0.561757
7
foo
three
1.928123
-
1.623033
|
分组,然后将函数总和应用于结果组。
|
1
2
3
4
5
6
|
In
[
93
]
:
df
.
groupby
(
'A'
)
.
sum
(
)
Out
[
93
]
:
C
D
A
bar
-
2.802588
2.42611
foo
3.146492
-
0.63958
|
按多列分组会形成一个分层索引,然后我们应用这个函数。
|
1
2
3
4
5
6
7
8
9
10
|
In
[
94
]
:
df
.
groupby
(
[
'A'
,
'B'
]
)
.
sum
(
)
Out
[
94
]
:
C
D
A
B
bar
one
-
1.814470
2.395985
three
-
0.595447
0.166599
two
-
0.392670
-
0.136473
foo
one
-
1.195665
-
0.616981
three
1.928123
-
1.623033
two
2.414034
1.600434
|
重塑
请参阅分层索引(http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-hierarchical)和重塑(http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-stacking)的章节。
堆(Stack)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
In
[
95
]
:
tuples
=
list
(
zip
(
*
[
[
'bar'
,
'bar'
,
'baz'
,
'baz'
,
.
.
.
.
:
'foo'
,
'foo'
,
'qux'
,
'qux'
]
,
.
.
.
.
:
[
'one'
,
'two'
,
'one'
,
'two'
,
.
.
.
.
:
'one'
,
'two'
,
'one'
,
'two'
]
]
)
)
.
.
.
.
:
In
[
96
]
:
index
=
pd
.
MultiIndex
.
from_tuples
(
tuples
,
names
=
[
'first'
,
'second'
]
)
In
[
97
]
:
df
=
pd
.
DataFrame
(
np
.
random
.
randn
(
8
,
2
)
,
index
=
index
,
columns
=
[
'A'
,
'B'
]
)
In
[
98
]
:
df2
=
df
[
:
4
]
In
[
99
]
:
df2
Out
[
99
]
:
A
B
first
second
bar
one
0.029399
-
0.542108
two
0.282696
-
0.087302
baz
one
-
1.575170
1.771208
two
0.816482
1.100230
|
stack()方法“压缩” DataFrame 列中的级别。
In [100]: stacked = df2.stack()</p><br> <p>In [101]: stacked<br /><br> Out[101]:<br /><br> first second<br /><br> bar one A 0.029399<br /><br> B -0.542108<br /><br> two A 0.282696<br /><br> B -0.087302<br /><br> baz one A -1.575170<br /><br> B 1.771208<br /><br> two A 0.816482<br /><br> B 1.100230<br /><br> dtype: float64
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
In
[
100
]
:
stacked
=
df2
.
stack
(
)
In
[
101
]
:
stacked
Out
[
101
]
:
first
second
bar
one
A
0.029399
B
-
0.542108
two
A
0.282696
B
-
0.087302
baz
one
A
-
1.575170
B
1.771208
two
A
0.816482
B
1.100230
dtype
:
float64
|
对于“堆叠的” DataFrame 或 Series(以MultiIndex为索引),stack()的逆操作是 unstack(),默认情况下,它将卸载最后一层:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
In
[
102
]
:
stacked
.
unstack
(
)
Out
[
102
]
:
A
B
first
second
bar
one
0.029399
-
0.542108
two
0.282696
-
0.087302
baz
one
-
1.575170
1.771208
two
0.816482
1.100230
In
[
103
]
:
stacked
.
unstack
(
1
)
Out
[
103
]
:
second
one
two
first
bar
A
0.029399
0.282696
B
-
0.542108
-
0.087302
baz
A
-
1.575170
0.816482
B
1.771208
1.100230
In
[
104
]
:
stacked
.
unstack
(
0
)
Out
[
104
]
:
first
bar
baz
second
one
A
0.029399
-
1.575170
B
-
0.542108
1.771208
two
A
0.282696
0.816482
B
-
0.087302
1.100230
|
数据透视表
请参阅数据透视表(http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-pivot)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
In
[
105
]
:
df
=
pd
.
DataFrame
(
{
'A'
:
[
'one'
,
'one'
,
'two'
,
'three'
]
*
3
,
.
.
.
.
.
:
'B'
:
[
'A'
,
'B'
,
'C'
]
*
4
,
.
.
.
.
.
:
'C'
:
[
'foo'
,
'foo'
,
'foo'
,
'bar'
,
'bar'
,
'bar'
]
*
2
,
.
.
.
.
.
:
'D'
:
np
.
random
.
randn
(
12
)
,
.
.
.
.
.
:
'E'
:
np
.
random
.
randn
(
12
)
}
)
.
.
.
.
.
:
In
[
106
]
:
df
Out
[
106
]
:
A
B
C
D
E
0
one
A
foo
1.418757
-
0.179666
1
one
B
foo
-
1.879024
1.291836
2
two
C
foo
0.536826
-
0.009614
3
three
A
bar
1.006160
0.392149
4
one
B
bar
-
0.029716
0.264599
5
one
C
bar
-
1.146178
-
0.057409
6
two
A
foo
0.100900
-
1.425638
7
three
B
foo
-
1.035018
1.024098
8
one
C
foo
0.314665
-
0.106062
9
one
A
bar
-
0.773723
1.824375
10
two
B
bar
-
1.170653
0.595974
11
three
C
bar
0.648740
1.167115
|
我们可以很容易地从这些数据生成数据透视表:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
In
[
107
]
:
pd
.
pivot_table
(
df
,
values
=
'D'
,
index
=
[
'A'
,
'B'
]
,
columns
=
[
'C'
]
)
Out
[
107
]
:
C
bar
foo
A
B
one
A
-
0.773723
1.418757
B
-
0.029716
-
1.879024
C
-
1.146178
0.314665
three
A
1.006160
NaN
B
NaN
-
1.035018
C
0.648740
NaN
two
A
NaN
0.100900
B
-
1.170653
NaN
C
NaN
0.536826
|
时间序列
熊猫具有用于在频率转换期间执行重采样操作(例如,其次将数据转换为5分钟数据)的简单,强大且高效的功能。这在金融应用中非常普遍,但不限于此。请参阅时间系列(http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries)
In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')</p><br> <p>In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)</p><br> <p>In [110]: ts.resample('5Min').sum()<br /><br> Out[110]:<br /><br> 2012-01-01 25083<br /><br> Freq: 5T, dtype: int64
|
1
2
3
4
5
6
7
8
|
In
[
108
]
:
rng
=
pd
.
date_range
(
'1/1/2012'
,
periods
=
100
,
freq
=
'S'
)
In
[
109
]
:
ts
=
pd
.
Series
(
np
.
random
.
randint
(
0
,
500
,
len
(
rng
)
)
,
index
=
rng
)
In
[
110
]
:
ts
.
resample
(
'5Min'
)
.
sum
(
)
Out
[
110
]
:
2012
-
01
-
01
25083
Freq
:
5T
,
dtype
:
int64
|
时区表示
In [111]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')</p><br> <p>In [112]: ts = pd.Series(np.random.randn(len(rng)), rng)</p><br> <p>In [113]: ts<br /><br> Out[113]:<br /><br> 2012-03-06 0.464000<br /><br> 2012-03-07 0.227371<br /><br> 2012-03-08 -0.496922<br /><br> 2012-03-09 0.306389<br /><br> 2012-03-10 -2.290613<br /><br> Freq: D, dtype: float64</p><br> <p>In [114]: ts_utc = ts.tz_localize('UTC')</p><br> <p>In [115]: ts_utc<br /><br> Out[115]:<br /><br> 2012-03-06 00:00:00+00:00 0.464000<br /><br> 2012-03-07 00:00:00+00:00 0.227371<br /><br> 2012-03-08 00:00:00+00:00 -0.496922<br /><br> 2012-03-09 00:00:00+00:00 0.306389<br /><br> 2012-03-10 00:00:00+00:00 -2.290613<br /><br> Freq: D, dtype: float64
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
In
[
111
]
:
rng
=
pd
.
date_range
(
'3/6/2012 00:00'
,
periods
=
5
,
freq
=
'D'
)
In
[
112
]
:
ts
=
pd
.
Series
(
np
.
random
.
randn
(
len
(
rng
)
)
,
rng
)
In
[
113
]
:
ts
Out
[
113
]
:
2012
-
03
-
06
0.464000
2012
-
03
-
07
0.227371
2012
-
03
-
08
-
0.496922
2012
-
03
-
09
0.306389
2012
-
03
-
10
-
2.290613
Freq
:
D
,
dtype
:
float64
In
[
114
]
:
ts_utc
=
ts
.
tz_localize
(
'UTC'
)
In
[
115
]
:
ts_utc
Out
[
115
]
:
2012
-
03
-
06
00
:
00
:
00
+
00
:
00
0.464000
2012
-
03
-
07
00
:
00
:
00
+
00
:
00
0.227371
2012
-
03
-
08
00
:
00
:
00
+
00
:
00
-
0.496922
2012
-
03
-
09
00
:
00
:
00
+
00
:
00
0.306389
2012
-
03
-
10
00
:
00
:
00
+
00
:
00
-
2.290613
Freq
:
D
,
dtype
:
float64
|
转换到另一个时区
|
1
2
3
4
5
6
7
8
|
In
[
116
]
:
ts_utc
.
tz_convert
(
'US/Eastern'
)
Out
[
116
]
:
2012
-
03
-
05
19
:
00
:
00
-
05
:
00
0.464000
2012
-
03
-
06
19
:
00
:
00
-
05
:
00
0.227371
2012
-
03
-
07
19
:
00
:
00
-
05
:
00
-
0.496922
2012
-
03
-
08
19
:
00
:
00
-
05
:
00
0.306389
2012
-
03
-
09
19
:
00
:
00
-
05
:
00
-
2.290613
Freq
:
D
,
dtype
:
float64
|
在时间跨度表示之间进行转换
In [117]: rng = pd.date_range('1/1/2012', periods=5, freq='M')</p><br> <p>In [118]: ts = pd.Series(np.random.randn(len(rng)), index=rng)</p><br> <p>In [119]: ts<br /><br> Out[119]:<br /><br> 2012-01-31 -1.134623<br /><br> 2012-02-29 -1.561819<br /><br> 2012-03-31 -0.260838<br /><br> 2012-04-30 0.281957<br /><br> 2012-05-31 1.523962<br /><br> Freq: M, dtype: float64</p><br> <p>In [120]: ps = ts.to_period()</p><br> <p>In [121]: ps<br /><br> Out[121]:<br /><br> 2012-01 -1.134623<br /><br> 2012-02 -1.561819<br /><br> 2012-03 -0.260838<br /><br> 2012-04 0.281957<br /><br> 2012-05 1.523962<br /><br> Freq: M, dtype: float64</p><br> <p>In [122]: ps.to_timestamp()<br /><br> Out[122]:<br /><br> 2012-01-01 -1.134623<br /><br> 2012-02-01 -1.561819<br /><br> 2012-03-01 -0.260838<br /><br> 2012-04-01 0.281957<br /><br> 2012-05-01 1.523962<br /><br> Freq: MS, dtype: float64
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
In
[
117
]
:
rng
=
pd
.
date_range
(
'1/1/2012'
,
periods
=
5
,
freq
=
'M'
)
In
[
118
]
:
ts
=
pd
.
Series
(
np
.
random
.
randn
(
len
(
rng
)
)
,
index
=
rng
)
In
[
119
]
:
ts
Out
[
119
]
:
2012
-
01
-
31
-
1.134623
2012
-
02
-
29
-
1.561819
2012
-
03
-
31
-
0.260838
2012
-
04
-
30
0.281957
2012
-
05
-
31
1.523962
Freq
:
M
,
dtype
:
float64
In
[
120
]
:
ps
=
ts
.
to_period
(
)
In
[
121
]
:
ps
Out
[
121
]
:
2012
-
01
-
1.134623
2012
-
02
-
1.561819
2012
-
03
-
0.260838
2012
-
04
0.281957
2012
-
05
1.523962
Freq
:
M
,
dtype
:
float64
In
[
122
]
:
ps
.
to_timestamp
(
)
Out
[
122
]
:
2012
-
01
-
01
-
1.134623
2012
-
02
-
01
-
1.561819
2012
-
03
-
01
-
0.260838
2012
-
04
-
01
0.281957
2012
-
05
-
01
1.523962
Freq
:
MS
,
dtype
:
float64
|
周期和时间戳之间的转换可以使用一些方便的算术功能。在下面的例子中,我们将季度结束时间从11月份转换为季末结束时的上午9点:
In [123]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')</p><br> <p>In [124]: ts = pd.Series(np.random.randn(len(prng)), prng)</p><br> <p>In [125]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9</p><br> <p>In [126]: ts.head()<br /><br> Out[126]:<br /><br> 1990-03-01 09:00 -0.902937<br /><br> 1990-06-01 09:00 0.068159<br /><br> 1990-09-01 09:00 -0.057873<br /><br> 1990-12-01 09:00 -0.368204<br /><br> 1991-03-01 09:00 -1.144073<br /><br> Freq: H, dtype: float64
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
In
[
123
]
:
prng
=
pd
.
period_range
(
'1990Q1'
,
'2000Q4'
,
freq
=
'Q-NOV'
)
In
[
124
]
:
ts
=
pd
.
Series
(
np
.
random
.
randn
(
len
(
prng
)
)
,
prng
)
In
[
125
]
:
ts
.
index
=
(
prng
.
asfreq
(
'M'
,
'e'
)
+
1
)
.
asfreq
(
'H'
,
's'
)
+
9
In
[
126
]
:
ts
.
head
(
)
Out
[
126
]
:
1990
-
03
-
01
09
:
00
-
0.902937
1990
-
06
-
01
09
:
00
0.068159
1990
-
09
-
01
09
:
00
-
0.057873
1990
-
12
-
01
09
:
00
-
0.368204
1991
-
03
-
01
09
:
00
-
1.144073
Freq
:
H
,
dtype
:
float64
|
分类
熊猫可以在 DataFrame 中包含分类数据。有关完整文档,请参阅分类介绍(http://pandas.pydata.org/pandas-docs/stable/categorical.html#categorical)和 API 文档(http://pandas.pydata.org/pandas-docs/stable/api.html#api-categorical)
|
1
|
In
[
127
]
:
df
=
pd
.
DataFrame
(
{
"id"
:
[
1
,
2
,
3
,
4
,
5
,
6
]
,
"raw_grade"
:
[
'a'
,
'b'
,
'b'
,
'a'
,
'a'
,
'e'
]
}
)
|
将原始等级转换为分类数据类型。
In [128]: df["grade"] = df["raw_grade"].astype("category")</p><br> <p>In [129]: df["grade"]<br /><br> Out[129]:<br /><br> 0 a<br /><br> 1 b<br /><br> 2 b<br /><br> 3 a<br /><br> 4 a<br /><br> 5 e<br /><br> Name: grade, dtype: category<br /><br> Categories (3, object): [a, b, e]
|
1
2
3
4
5
6
7
8
9
10
11
12
|
In
[
128
]
:
df
[
"grade"
]
=
df
[
"raw_grade"
]
.
astype
(
"category"
)
In
[
129
]
:
df
[
"grade"
]
Out
[
129
]
:
0
a
1
b
2
b
3
a
4
a
5
e
Name
:
grade
,
dtype
:
category
Categories
(
3
,
object
)
:
[
a
,
b
,
e
]
|
将类别重命名为更有意义的名称(指定到 Series.cat.categories 就是!)
|
1
|
In
[
130
]
:
df
[
"grade"
]
.
cat
.
categories
=
[
"very good"
,
"good"
,
"very bad"
]
|
重新排列类别并同时添加缺少的类别(Series .cat 下的方法默认返回一个新的系列)。
In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])</p><br> <p>In [132]: df["grade"]<br /><br> Out[132]:<br /><br> 0 very good<br /><br> 1 good<br /><br> 2 good<br /><br> 3 very good<br /><br> 4 very good<br /><br> 5 very bad<br /><br> Name: grade, dtype: category<br /><br> Categories (5, object): [very bad, bad, medium, good, very good]
|
1
2
3
4
5
6
7
8
9
10
11
12
|
In
[
131
]
:
df
[
"grade"
]
=
df
[
"grade"
]
.
cat
.
set_categories
(
[
"very bad"
,
"bad"
,
"medium"
,
"good"
,
"very good"
]
)
In
[
132
]
:
df
[
"grade"
]
Out
[
132
]
:
0
very
good
1
good
2
good
3
very
good
4
very
good
5
very
bad
Name
:
grade
,
dtype
:
category
Categories
(
5
,
object
)
:
[
very
bad
,
bad
,
medium
,
good
,
very
good
]
|
排序是按类别排序的,而不是词汇顺序。
|
1
2
3
4
5
6
7
8
9
|
In
[
133
]
:
df
.
sort_values
(
by
=
"grade"
)
Out
[
133
]
:
id
raw_grade
grade
5
6
e
very
bad
1
2
b
good
2
3
b
good
0
1
a
very
good
3
4
a
very
good
4
5
a
very
good
|
按分类列分组也显示空白类别。
|
1
2
3
4
5
6
7
8
9
|
In
[
134
]
:
df
.
groupby
(
"grade"
)
.
size
(
)
Out
[
134
]
:
grade
very
bad
1
bad
0
medium
0
good
2
very
good
3
dtype
:
int64
|
绘制(Plotting)
绘制文档(http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization)
In [135]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))</p><br> <p>In [136]: ts = ts.cumsum()</p><br> <p>In [137]: ts.plot()<br /><br> Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x1122ad630>
|
1
2
3
4
5
6
|
In
[
135
]
:
ts
=
pd
.
Series
(
np
.
random
.
randn
(
1000
)
,
index
=
pd
.
date_range
(
'1/1/2000'
,
periods
=
1000
)
)
In
[
136
]
:
ts
=
ts
.
cumsum
(
)
In
[
137
]
:
ts
.
plot
(
)
Out
[
137
]
:
<
matplotlib
.
axes
.
_subplots
.
AxesSubplot
at
0x1122ad630
>
|
在 DataFrame 上,plot()方便绘制所有带标签的列:
|
1
2
3
4
5
6
7
8
|
In
[
138
]
:
df
=
pd
.
DataFrame
(
np
.
random
.
randn
(
1000
,
4
)
,
index
=
ts
.
index
,
.
.
.
.
.
:
columns
=
[
'A'
,
'B'
,
'C'
,
'D'
]
)
.
.
.
.
.
:
In
[
139
]
:
df
=
df
.
cumsum
(
)
In
[
140
]
:
plt
.
figure
(
)
;
df
.
plot
(
)
;
plt
.
legend
(
loc
=
'best'
)
Out
[
140
]
:
<
matplotlib
.
legend
.
Legend
at
0x115033cf8
>
|
数据输入/输出
CSV
写入一个CSV文件(http://pandas.pydata.org/pandas-docs/stable/io.html#io-store-in-csv)
|
1
|
In
[
141
]
:
df
.
to_csv
(
'foo.csv'
)
|
从 csv 文件读取(http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
In
[
142
]
:
pd
.
read_csv
(
'foo.csv'
)
Out
[
142
]
:
Unnamed
:
0
A
B
C
D
0
2000
-
01
-
01
0.266457
-
0.399641
-
0.219582
1.186860
1
2000
-
01
-
02
-
1.170732
-
0.345873
1.653061
-
0.282953
2
2000
-
01
-
03
-
1.734933
0.530468
2.060811
-
0.515536
3
2000
-
01
-
04
-
1.555121
1.452620
0.239859
-
1.156896
4
2000
-
01
-
05
0.578117
0.511371
0.103552
-
2.428202
5
2000
-
01
-
06
0.478344
0.449933
-
0.741620
-
1.962409
6
2000
-
01
-
07
1.235339
-
0.091757
-
1.543861
-
1.084753
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
993
2002
-
09
-
20
-
10.628548
-
9.153563
-
7.883146
28.313940
994
2002
-
09
-
21
-
10.390377
-
8.727491
-
6.399645
30.914107
995
2002
-
09
-
22
-
8.985362
-
8.485624
-
4.669462
31.367740
996
2002
-
09
-
23
-
9.558560
-
8.781216
-
4.499815
30.518439
997
2002
-
09
-
24
-
9.902058
-
9.340490
-
4.386639
30.105593
998
2002
-
09
-
25
-
10.216020
-
9.480682
-
3.933802
29.758560
999
2002
-
09
-
26
-
11.856774
-
10.671012
-
3.216025
29.369368
[
1000
rows
x
5
columns
]
|
HDF5
读写 HDFStore:http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5
写入 HDF5
|
1
|
In
[
143
]
:
df
.
to_hdf
(
'foo.h5'
,
'df'
)
|
读取 HDF5
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
In
[
144
]
:
pd
.
read_hdf
(
'foo.h5'
,
'df'
)
Out
[
144
]
:
A
B
C
D
2000
-
01
-
01
0.266457
-
0.399641
-
0.219582
1.186860
2000
-
01
-
02
-
1.170732
-
0.345873
1.653061
-
0.282953
2000
-
01
-
03
-
1.734933
0.530468
2.060811
-
0.515536
2000
-
01
-
04
-
1.555121
1.452620
0.239859
-
1.156896
2000
-
01
-
05
0.578117
0.511371
0.103552
-
2.428202
2000
-
01
-
06
0.478344
0.449933
-
0.741620
-
1.962409
2000
-
01
-
07
1.235339
-
0.091757
-
1.543861
-
1.084753
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2002
-
09
-
20
-
10.628548
-
9.153563
-
7.883146
28.313940
2002
-
09
-
21
-
10.390377
-
8.727491
-
6.399645
30.914107
2002
-
09
-
22
-
8.985362
-
8.485624
-
4.669462
31.367740
2002
-
09
-
23
-
9.558560
-
8.781216
-
4.499815
30.518439
2002
-
09
-
24
-
9.902058
-
9.340490
-
4.386639
30.105593
2002
-
09
-
25
-
10.216020
-
9.480682
-
3.933802
29.758560
2002
-
09
-
26
-
11.856774
-
10.671012
-
3.216025
29.369368
[
1000
rows
x
4
columns
]
|
Excel
阅读和写入 MS Excel:http://pandas.pydata.org/pandas-docs/stable/io.html#io-excel
写入一个 excel 文件
|
1
|
In
[
145
]
:
df
.
to_excel
(
'foo.xlsx'
,
sheet_name
=
'Sheet1'
)
|
从 Excel 文件中读取
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
In
[
146
]
:
pd
.
read_excel
(
'foo.xlsx'
,
'Sheet1'
,
index_col
=
None
,
na_values
=
[
'NA'
]
)
Out
[
146
]
:
A
B
C
D
2000
-
01
-
01
0.266457
-
0.399641
-
0.219582
1.186860
2000
-
01
-
02
-
1.170732
-
0.345873
1.653061
-
0.282953
2000
-
01
-
03
-
1.734933
0.530468
2.060811
-
0.515536
2000
-
01
-
04
-
1.555121
1.452620
0.239859
-
1.156896
2000
-
01
-
05
0.578117
0.511371
0.103552
-
2.428202
2000
-
01
-
06
0.478344
0.449933
-
0.741620
-
1.962409
2000
-
01
-
07
1.235339
-
0.091757
-
1.543861
-
1.084753
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2002
-
09
-
20
-
10.628548
-
9.153563
-
7.883146
28.313940
2002
-
09
-
21
-
10.390377
-
8.727491
-
6.399645
30.914107
2002
-
09
-
22
-
8.985362
-
8.485624
-
4.669462
31.367740
2002
-
09
-
23
-
9.558560
-
8.781216
-
4.499815
30.518439
2002
-
09
-
24
-
9.902058
-
9.340490
-
4.386639
30.105593
2002
-
09
-
25
-
10.216020
-
9.480682
-
3.933802
29.758560
2002
-
09
-
26
-
11.856774
-
10.671012
-
3.216025
29.369368
[
1000
rows
x
4
columns
]
|
陷阱
如果你正在尝试一个操作,你会看到一个异常:
|
1
2
3
4
5
|
>>>
if
pd
.
Series
(
[
False
,
True
,
False
]
)
:
print
(
"I was true"
)
Traceback
.
.
.
ValueError
:
The
truth
value
of
an
array
is
ambiguous
.
Use
a
.
empty
,
a
.
any
(
)
or
a
.
all
(
)
.
|
541

被折叠的 条评论
为什么被折叠?



