Python数据科学库(二)
一、引入
import pandas as pd # 数据分析, 代码基于numpy
import numpy as np # 处理数据, 代码基于ndarray
二、创建对象
(一)Series对象
Series对象是一行的数据类型,类似于一位数组
1、通过序列创建
(1)使用默认行名
# 默认以数字0开始作为键值使用np.nan表示不参与计算
s = pd.Series([1, 3, 5, np.nan, 6, 8])
# 结果
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
(2)使用自定义行名
# index属性指定键(即每行名称)
s = pd.Series([1, 2, 3, 4], index=list("abcd"))
print(s)
# 结果
a 1
b 2
c 3
d 4
dtype: int64
2、通过字典创建
# 字典的键会自动变为行名
d = {"name":"Tom", "age":18}
s = pd.Series(d)
# 结果
name Tom
age 18
dtype: object
(二)DataFrame对象
DataFrame对象为表格对象,可以理解为是二位数组
1、通过序列创建
dates = pd.date_range("20180112", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df)
# 结果
A B C D
2018-01-12 -1.273980 -0.611487 0.694974 0.912239
2018-01-13 -1.161407 -0.054078 -0.085019 0.372066
2018-01-14 -1.639126 0.405671 -0.842639 0.273583
2018-01-15 -0.209314 0.569439 1.809941 0.245962
2018-01-16 0.643041 -0.783864 0.149103 -0.803913
2018-01-17 1.118176 -0.526984 -1.349010 0.155028
2、通过字典创建
(1)使用默认行名
# 列可以是多种, 但是数量要正确
d = {
"A":1, # 列可以是一个数, 会应用到整个列
"B":pd.Timestamp("20180413"), # 列可以是时间戳, 会应用到整个列
"C":pd.Series(1, index=list(range(4)), dtype='float32'), # 列可以是一个序列
"D":np.array([3] * 4, dtype='int32'), # 列可以是ndarray
"F":"foo", # 列可以是字符串, 会应用到整列
"G":[1, 2, 3, 4] # 列可以是列表
}
df2 = pd.DataFrame(d)
print(df2)
# 可以使用dtypes查看每列类型
print(df2.dtypes)
# 结果
A B C D F G
0 1 2018-04-13 1.0 3 foo 1
1 1 2018-04-13 1.0 3 foo 2
2 1 2018-04-13 1.0 3 foo 3
3 1 2018-04-13 1.0 3 foo 4
A int64
B datetime64[ns]
C float32
D int32
F object
G int64
dtype: object
(2)使用自定义行名
d = {"name":["Tom", "Bob", "Lili"], "age":[10, 20, 30]}
df = pd.DataFrame(d, index=["1", "2", "3"])
print(df)
# 结果
name age
1 Tom 10
2 Bob 20
3 Lili 30
三、查看数据
(一)查看头尾数据
d = {
"name":["a", "b", "c", "d", "e", "f", "g"],
"age":[10, 20, 30, 40, 50, 60, 70]
}
df = pd.DataFrame(d)
# 查看前几行, 默认值为5
print(df.head())
# 查看后几行, 默认值为5
print(df.tail(3))
# 结果
name age
0 a 10
1 b 20
2 c 30
3 d 40
4 e 50
name age
4 e 50
5 f 60
6 g 70
(二)查看其转换为ndarray数据
DataFrame.to_numpy()对于转换非数值类型的数据会有很大时间花费,
对于numpy,每个numpy才有dtype,而对于DataFrame,每一列都有dtype
当DataFrame转numpy时,会选择DataFrame所有列中最大的类型进行转换,这种类型一般都会取到object类型
print(df2.to_numpy())
# 结果
dtype: object
[[1 Timestamp('2018-04-13 00:00:00') 1.0 3 'foo' 1]
[1 Timestamp('2018-04-13 00:00:00') 1.0 3 'foo' 2]
[1 Timestamp('2018-04-13 00:00:00') 1.0 3 'foo' 3]
[1 Timestamp('2018-04-13 00:00:00') 1.0 3 'foo' 4]]
(二)查看行键、列键、数据
d = {
"name":["a", "b", "c", "d", "e", "f", "g"],
"age":[10, 20, 30, 40, 50, 60, 70]
}
df = pd.DataFrame(d, index=['1', '2', '3', '4', '5', '6', '7'])
# 查看行键
print(df.index)
# 查看列键
print(df.columns)
# 查看数据
print(df.values)
# 结果
Index(['1', '2', '3', '4', '5', '6', '7'], dtype='object')
Index(['name', 'age'], dtype='object')
[['a' 10]
['b' 20]
['c' 30]
['d' 40]
['e' 50]
['f' 60]
['g' 70]]
(三)查看数据整体情况
d = {
"name":["a", "b", "c", "d", "e", "f", "g"],
"age":[10, 20, 30, 40, 50, 60, 70]
}
df = pd.DataFrame(d, index=['1', '2', '3', '4', '5', '6', '7'])
# 查看数据整体情况
print(df.describe())
# 结果
age
count 7.000000
mean 40.000000
std 21.602469
min 10.000000
25% 25.000000
50% 40.000000
75% 55.000000
max 70.000000
(四)转置
d = {
"name":["a", "b", "c", "d", "e", "f", "g"],
"age":[10, 20, 30, 40, 50, 60, 70]
}
df = pd.DataFrame(d, index=['1', '2', '3', '4', '5', '6', '7'])
# 查看数据整体情况
print(df.T)
# 结果
1 2 3 4 5 6 7
name a b c d e f g
age 10 20 30 40 50 60 70
(五)根据行列排序
d = {
"name":["a", "b", "c", "d", "e", "f", "g"],
"age":[40, 30, 20, 10, 50, 70, 60]
}
df = pd.DataFrame(d, index=['1', '2', '3', '4', '5', '6', '7'])
# 根据行列排序, axis=0是行排序, ascending参数就是逆序
print(df.sort_index(axis=0, ascending=False))
# 根据值
print(df.sort_values(by="age"))
# 结果
name age
7 g 60
6 f 70
5 e 50
4 d 10
3 c 20
2 b 30
1 a 40
name age
4 d 10
3 c 20
2 b 30
1 a 40
5 e 50
7 g 60
6 f 70
四、选择数据
注意,虽然用于选择和设置的标准Python/Numpy表达式是直观的,并且对于交互工作很方便,但是对于生产代码,我们推荐优化的pandas数据访问方法.at、.iat、.loc和.iloc,因为它们更高效。
(一)选择单列
d = {
"name":["a", "b", "c", "d", "e", "f", "g"],
"age":[40, 30, 20, 10, 50, 70, 60]
}
df = pd.DataFrame(d, index=['1', '2', '3', '4', '5', '6', '7'])
# 两种选择方法等价
print(df["age"])
print(df.age)
# 结果
1 40
2 30
3 20
4 10
5 50
6 70
7 60
Name: age, dtype: int64
1 40
2 30
3 20
4 10
5 50
6 70
7 60
Name: age, dtype: int64
(二)通过切片选择
date = pd.date_range("20190810", periods=12)
df = pd.DataFrame(np.random.random((12, 4)), index=date, columns=list("abcd"))
print(df)
print(df[1:6])
print(df['20190811':'20190815'])
# 结果
a b c d
2019-08-10 0.174172 0.460687 0.617525 0.953005
2019-08-11 0.192005 0.621483 0.486393 0.449644
2019-08-12 0.939978 0.639913 0.006101 0.667578
2019-08-13 0.769165 0.785516 0.640104 0.822523
2019-08-14 0.806900 0.007822 0.650755 0.972287
2019-08-15 0.086788 0.231559 0.083912 0.788760
2019-08-16 0.297233 0.366279 0.770901 0.063383
2019-08-17 0.712211 0.169660 0.766693 0.310412
2019-08-18 0.602567 0.045968 0.388886 0.671598
2019-08-19 0.248061 0.721344 0.982080 0.818999
2019-08-20 0.669051 0.891325 0.384047 0.157094
2019-08-21 0.966698 0.962379 0.543812 0.763704
a b c d
2019-08-11 0.192005 0.621483 0.486393 0.449644
2019-08-12 0.939978 0.639913 0.006101 0.667578
2019-08-13 0.769165 0.785516 0.640104 0.822523
2019-08-14 0.806900 0.007822 0.650755 0.972287
2019-08-15 0.086788 0.231559 0.083912 0.788760
a b c d
2019-08-11 0.192005 0.621483 0.486393 0.449644
2019-08-12 0.939978 0.639913 0.006101 0.667578
2019-08-13 0.769165 0.785516 0.640104 0.822523
2019-08-14 0.806900 0.007822 0.650755 0.972287
2019-08-15 0.086788 0.231559 0.083912 0.788760
(三)通过标签选择
dates = pd.date_range("20190810", periods=12)
df = pd.DataFrame(np.random.random((12, 4)), index=dates, columns=list("abcd"))
print(df)
# 选择单行
# loc 选择的必须是key
print(df.loc[dates[2]])
# 选择多行
print(df.loc[:, ['a', 'b']])
# 列表切片, 两端点包括
print(df.loc[dates[2:4], ['c', 'd']])
# 获取标量值(方法一)
print(df.loc[dates[0], 'b'])
# 获取标量值(方法二: 此方法更高效)
print(df.at[dates[0], 'b'])
a b c d
2019-08-10 0.817112 0.327034 0.979125 0.803224
2019-08-11 0.672825 0.070317 0.466559 0.396858
2019-08-12 0.104463 0.462727 0.526324 0.673956
2019-08-13 0.145612 0.319792 0.043166 0.682091
2019-08-14 0.545719 0.612606 0.781398 0.454193
2019-08-15 0.747837 0.268693 0.692505 0.369949
2019-08-16 0.504633 0.829390 0.126283 0.673119
2019-08-17 0.494251 0.255753 0.991743 0.779379
2019-08-18 0.604067 0.530440 0.997617 0.181184
2019-08-19 0.590527 0.662686 0.392652 0.012199
2019-08-20 0.840013 0.521499 0.703242 0.546945
2019-08-21 0.663295 0.567195 0.890904 0.899999
a 0.104463
b 0.462727
c 0.526324
d 0.673956
Name: 2019-08-12 00:00:00, dtype: float64
a b
2019-08-10 0.817112 0.327034
2019-08-11 0.672825 0.070317
2019-08-12 0.104463 0.462727
2019-08-13 0.145612 0.319792
2019-08-14 0.545719 0.612606
2019-08-15 0.747837 0.268693
2019-08-16 0.504633 0.829390
2019-08-17 0.494251 0.255753
2019-08-18 0.604067 0.530440
2019-08-19 0.590527 0.662686
2019-08-20 0.840013 0.521499
2019-08-21 0.663295 0.567195
c d
2019-08-12 0.526324 0.673956
2019-08-13 0.043166 0.682091
0.32703397959149083
0.32703397959149083
(四)通过位置选择
# 数据
a b c d
2019-08-10 0.764402 0.166910 0.704202 0.147261
2019-08-11 0.284636 0.396147 0.388614 0.049924
2019-08-12 0.350064 0.678498 0.945081 0.070063
2019-08-13 0.568517 0.158599 0.729990 0.065292
2019-08-14 0.859671 0.392206 0.297203 0.259528
2019-08-15 0.429258 0.739037 0.716649 0.649104
2019-08-16 0.424900 0.487110 0.312030 0.260901
2019-08-17 0.422411 0.309472 0.239190 0.479773
2019-08-18 0.034542 0.077123 0.316601 0.483878
2019-08-19 0.141800 0.321642 0.459081 0.790553
2019-08-20 0.778352 0.957171 0.067329 0.591832
2019-08-21 0.795865 0.232758 0.772976 0.598431
- 通过传递整数的位置选择
print(df.iloc[3])
# 结果
a 0.568517
b 0.158599
c 0.729990
d 0.065292
Name: 2019-08-13 00:00:00, dtype: float64
- 整数切片, 类似于Python、numpy
print(df.iloc[3:5, 0:2])
# 结果
a b
2019-08-13 0.568517 0.158599
2019-08-14 0.859671 0.392206
- 按整数的位置列表, 类似于Python、numpy(花式索引)
print(df.iloc[[1, 2, 3], [1, 2, 3]])
# 结果
b c d
2019-08-11 0.396147 0.388614 0.049924
2019-08-12 0.678498 0.945081 0.070063
2019-08-13 0.158599 0.729990 0.065292
- 对行进行显示切片
print(df.iloc[1:3, :])
# 结果
a b c d
2019-08-11 0.284636 0.396147 0.388614 0.049924
2019-08-12 0.350064 0.678498 0.945081 0.070063
- 对列进行显示切片
print(df.iloc[:, 1:3])
# 结果
b c
2019-08-10 0.166910 0.704202
2019-08-11 0.396147 0.388614
2019-08-12 0.678498 0.945081
2019-08-13 0.158599 0.729990
2019-08-14 0.392206 0.297203
2019-08-15 0.739037 0.716649
2019-08-16 0.487110 0.312030
2019-08-17 0.309472 0.239190
2019-08-18 0.077123 0.316601
2019-08-19 0.321642 0.459081
2019-08-20 0.957171 0.067329
2019-08-21 0.232758 0.772976
- 显示获取值
print(df.iloc[1, 3])
print(df.iat[1, 3]) # 这种方法更高效
# 结果
0.04992388306712903
0.04992388306712903
(五)布尔索引
# 数据
a b c d
2019-08-10 0.557081 0.323802 0.928420 0.621318
2019-08-11 0.443987 0.200434 0.215998 0.148197
2019-08-12 0.545752 0.484326 0.330282 0.158217
2019-08-13 0.881359 0.532641 0.050309 0.806077
2019-08-14 0.510212 0.149624 0.257880 0.722024
2019-08-15 0.874834 0.055305 0.395142 0.237985
2019-08-16 0.389190 0.398724 0.889894 0.356322
2019-08-17 0.157651 0.942913 0.390689 0.589924
2019-08-18 0.581344 0.688228 0.687062 0.590249
2019-08-19 0.405766 0.260050 0.328965 0.685247
2019-08-20 0.691915 0.785985 0.480453 0.030770
2019-08-21 0.703382 0.749799 0.831952 0.602684
- 使用单列来选择数据
print(df[df.a > 0.5])
# 结果
a b c d
2019-08-10 0.557081 0.323802 0.928420 0.621318
2019-08-12 0.545752 0.484326 0.330282 0.158217
2019-08-13 0.881359 0.532641 0.050309 0.806077
2019-08-14 0.510212 0.149624 0.257880 0.722024
2019-08-15 0.874834 0.055305 0.395142 0.237985
2019-08-18 0.581344 0.688228 0.687062 0.590249
2019-08-20 0.691915 0.785985 0.480453 0.030770
2019-08-21 0.703382 0.749799 0.831952 0.602684
- 从满足布尔条件的DataFrame中选择值
print(df[df > 0.5])
# 结果
a b c d
2019-08-10 0.557081 NaN 0.928420 0.621318
2019-08-11 NaN NaN NaN NaN
2019-08-12 0.545752 NaN NaN NaN
2019-08-13 0.881359 0.532641 NaN 0.806077
2019-08-14 0.510212 NaN NaN 0.722024
2019-08-15 0.874834 NaN NaN NaN
2019-08-16 NaN NaN 0.889894 NaN
2019-08-17 NaN 0.942913 NaN 0.589924
2019-08-18 0.581344 0.688228 0.687062 0.590249
2019-08-19 NaN NaN NaN 0.685247
2019-08-20 0.691915 0.785985 NaN NaN
2019-08-21 0.703382 0.749799 0.831952 0.602684
- 使用isin()方法进行过滤
df2 = df.copy()
df2['e'] = list(range(12))
print(df2)
print(df2[df2['e'].isin(['1', '2'])])
# 结果
a b c d e
2019-08-10 0.557081 0.323802 0.928420 0.621318 0
2019-08-11 0.443987 0.200434 0.215998 0.148197 1
2019-08-12 0.545752 0.484326 0.330282 0.158217 2
2019-08-13 0.881359 0.532641 0.050309 0.806077 3
2019-08-14 0.510212 0.149624 0.257880 0.722024 4
2019-08-15 0.874834 0.055305 0.395142 0.237985 5
2019-08-16 0.389190 0.398724 0.889894 0.356322 6
2019-08-17 0.157651 0.942913 0.390689 0.589924 7
2019-08-18 0.581344 0.688228 0.687062 0.590249 8
2019-08-19 0.405766 0.260050 0.328965 0.685247 9
2019-08-20 0.691915 0.785985 0.480453 0.030770 10
2019-08-21 0.703382 0.749799 0.831952 0.602684 11
a b c d e
2019-08-11 0.443987 0.200434 0.215998 0.148197 1
2019-08-12 0.545752 0.484326 0.330282 0.158217 2
五、设置数据
(一)通过Series设置数据
dates = pd.date_range("20190810", periods=12)
df = pd.DataFrame(np.random.random((12, 4)), index=dates, columns=list("abcd"))
print(df)
print()
s1 = pd.Series(list(range(12)), index=dates)
print(s1)
df['e'] = s1
print(df)
# 结果
a b c d
2019-08-10 0.739508 0.655913 0.643200 0.022093
2019-08-11 0.558993 0.379173 0.884394 0.880572
2019-08-12 0.233417 0.361764 0.694392 0.473259
2019-08-13 0.490941 0.583280 0.129630 0.158678
2019-08-14 0.730997 0.692483 0.385840 0.777237
2019-08-15 0.277806 0.209064 0.951801 0.440764
2019-08-16 0.315110 0.836244 0.964863 0.463869
2019-08-17 0.044120 0.353705 0.995395 0.827520
2019-08-18 0.520424 0.749487 0.385006 0.204673
2019-08-19 0.161438 0.670618 0.886535 0.077410
2019-08-20 0.389793 0.924471 0.757761 0.876182
2019-08-21 0.138309 0.338697 0.787739 0.393892
2019-08-10 0
2019-08-11 1
2019-08-12 2
2019-08-13 3
2019-08-14 4
2019-08-15 5
2019-08-16 6
2019-08-17 7
2019-08-18 8
2019-08-19 9
2019-08-20 10
2019-08-21 11
Freq: D, dtype: int64
a b c d e
2019-08-10 0.739508 0.655913 0.643200 0.022093 0
2019-08-11 0.558993 0.379173 0.884394 0.880572 1
2019-08-12 0.233417 0.361764 0.694392 0.473259 2
2019-08-13 0.490941 0.583280 0.129630 0.158678 3
2019-08-14 0.730997 0.692483 0.385840 0.777237 4
2019-08-15 0.277806 0.209064 0.951801 0.440764 5
2019-08-16 0.315110 0.836244 0.964863 0.463869 6
2019-08-17 0.044120 0.353705 0.995395 0.827520 7
2019-08-18 0.520424 0.749487 0.385006 0.204673 8
2019-08-19 0.161438 0.670618 0.886535 0.077410 9
2019-08-20 0.389793 0.924471 0.757761 0.876182 10
2019-08-21 0.138309 0.338697 0.787739 0.393892 11
(二)通过标签设置数据
df.at[dates[0], 'a'] = 10000
print(df)
# 结果
a b c d
2019-08-10 10000.000000 0.635320 0.010086 0.005882
2019-08-11 0.240779 0.971029 0.001552 0.432084
2019-08-12 0.777127 0.753467 0.011913 0.285377
2019-08-13 0.075236 0.735161 0.537506 0.388902
2019-08-14 0.496185 0.028398 0.030284 0.871697
2019-08-15 0.047100 0.422052 0.692298 0.477713
2019-08-16 0.900423 0.201646 0.804558 0.094761
2019-08-17 0.229455 0.678642 0.789862 0.287174
2019-08-18 0.722047 0.966331 0.934613 0.362185
2019-08-19 0.414051 0.987032 0.168833 0.955266
2019-08-20 0.925989 0.409541 0.575051 0.284847
2019-08-21 0.098360 0.261987 0.577213 0.978246
(三)通过位置设置数据
df.iat[0, 0] = 20000
print(df)
# 结果
a b c d
2019-08-10 20000.000000 0.115457 0.685444 0.705060
2019-08-11 0.498748 0.166391 0.571270 0.996905
2019-08-12 0.587192 0.084088 0.151117 0.896670
2019-08-13 0.289092 0.294163 0.449005 0.196895
2019-08-14 0.872205 0.184757 0.033095 0.702049
2019-08-15 0.670715 0.518818 0.147044 0.539214
2019-08-16 0.980973 0.102361 0.667249 0.931526
2019-08-17 0.335769 0.047684 0.953198 0.676727
2019-08-18 0.218863 0.371770 0.468017 0.159810
2019-08-19 0.140440 0.281763 0.182614 0.539620
2019-08-20 0.357675 0.666259 0.235839 0.402222
2019-08-21 0.042178 0.764067 0.339020 0.954424
(四)通过numpy设置值
df.loc[:, 'd'] = np.array([5] * len(df))
print(df)
# 结果
a b c d
2019-08-10 0.303649 0.099096 0.494954 5
2019-08-11 0.968202 0.704928 0.036409 5
2019-08-12 0.488337 0.940673 0.955685 5
2019-08-13 0.796000 0.387385 0.180210 5
2019-08-14 0.670997 0.732318 0.228759 5
2019-08-15 0.041587 0.599849 0.531155 5
2019-08-16 0.283556 0.341993 0.999993 5
2019-08-17 0.463613 0.873346 0.590491 5
2019-08-18 0.274485 0.164301 0.448453 5
2019-08-19 0.124367 0.434915 0.361775 5
2019-08-20 0.296049 0.282478 0.780274 5
2019-08-21 0.889193 0.611455 0.699893 5