2. Basic Data Preparation/基本数据准备
The data is not ready to use. We must prepare it first.
数据没有准备好来用,我们必须先准备好它(说处理多好理解)
Below are the first few rows of the raw dataset.
1
2
3
4
5
6
|
No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
1,2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0
2,2010,1,1,1,NA,-21,-12,1020,NW,4.92,0,0
3,2010,1,1,2,NA,-21,-11,1019,NW,6.71,0,0
4,2010,1,1,3,NA,-21,-14,1019,NW,9.84,0,0
5,2010,1,1,4,NA,-20,-12,1018,NW,12.97,0,0
|
The first step is to consolidate the date-time information into a single date-time so that we can use it as an index in Pandas.
第一步是将日期时间信息合并一单独的时间日期列(列),以便我们将其作为pandas中的索引。
A quick check reveals NA values for pm2.5 for the first 24 hours. We will, therefore, need to remove the first row of data. There are also a few scattered “NA” values later in the dataset; we can mark them with 0 values for now.
快速检查显示前24小时的pm2.5值为NA,所以我们将需要删除数据的第一行,数据集中还有一些零散的NA值,我们可以用0值标记他们
The script below loads the raw dataset and parses the date-time information as the Pandas DataFrame index. The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.
下面的脚本加载原始数据,并将日期-时间信息解析为Pandas DataFrame的索引,‘No’列被删除,然后被每列指定更清晰的名字,最后NA值被0取代,最初的24小时(数据)被移除。
The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.(重复,排版出错吗?)
from pandas import read_csv from datetime import datetime # load data def parse(x): return datetime.strptime(x, '%Y %m %d %H') dataset = read_csv('raw.csv', parse_dates=[['year', 'month', 'day', 'hour']], index_col=0, date_parser=parse) dataset.drop('No', axis=1, inplace=True) # manually specify column names dataset.columns = ['pollution', 'dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain'] dataset.index.name = 'date' # mark all NA values with 0 dataset['pollution'].fillna(0, inplace=True) # drop the first 24 hours dataset = dataset[24:] # summarize first 5 rows print(dataset.head(5)) # save to file dataset.to_csv('pollution.csv')
Running the example prints the first 5 rows of the transformed dataset and saves the dataset to “pollution.csv“.
1
2
3
4
5
6
7
|
pollution dew temp press wnd_dir wnd_spd snow rain
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
|
Now that we have the data in an easy-to-use form, we can create a quick plot of each series and see what we have.
现在我有了容易用 格式的数据,我们可以创建每一个系列的快速图,来查看我们都有了什么
The code below loads the new “pollution.csv” file and plots each series as a separate subplot, except wind speed dir, which is categorical.
下面的代码加载新的“pollution.csv”文件,并将每个系列绘制为一个单独的子图,除了风速目录(这是明确的)之外。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
from
pandas
import
read_csv
from
matplotlib
import
pyplot
# load dataset
dataset
=
read_csv
(
'pollution.csv'
,
header
=
0
,
index_col
=
0
)
values
=
dataset
.
values
# specify columns to plot
groups
=
[
0
,
1
,
2
,
3
,
5
,
6
,
7
]
i
=
1
# plot each column
pyplot
.
figure
(
)
for
group
in
groups
:
pyplot
.
subplot
(
len
(
groups
)
,
1
,
i
)
pyplot
.
plot
(
values
[
:
,
group
]
)
pyplot
.
title
(
dataset
.
columns
[
group
]
,
y
=
0.5
,
loc
=
'right'
)
i
+=
1
pyplot
.
show
(
)
|
Running the example creates a plot with 7 subplots showing the 5 years of data for each variable.
运行该示例将创建一个包含7个子图的图表,显示每个变量的5年数据。

Line Plots of Air Pollution Time Series
matplotlib中各个函数的含义及用法看下面的例子:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
|
import
numpy
as
np
import
matplotlib
as
mpl
import
matplotlib
.
pyplot
as
plt
mpl
.
rcParams
[
'axes.titlesize'
]
=
20
mpl
.
rcParams
[
'xtick.labelsize'
]
=
16
mpl
.
rcParams
[
'ytick.labelsize'
]
=
16
mpl
.
rcParams
[
'axes.labelsize'
]
=
16
mpl
.
rcParams
[
'xtick.major.size'
]
=
0
mpl
.
rcParams
[
'ytick.major.size'
]
=
0
# 包含了狗,猫和猎豹的最高奔跑速度,还有对应的可视化颜色
speed_map
=
{
'dog'
:
(
48
,
'#7199cf'
)
,
'cat'
:
(
45
,
'#4fc4aa'
)
,
'cheetah'
:
(
120
,
'#e1a7a2'
)
}
# 整体图的标题
fig
=
plt
.
figure
(
'Bar chart & Pie chart'
)
# 在整张图上加入一个子图,121的意思是在一个1行2列的子图中的第一张
ax
=
fig
.
add_subplot
(
121
)
ax
.
set_title
(
'Running speed - bar chart'
)
# 生成x轴每个元素的位置
xticks
=
np
.
arange
(
3
)
# 定义柱状图每个柱的宽度
bar_width
=
0.5
# 动物名称
animals
=
speed_map
.
keys
(
)
# 奔跑速度
speeds
=
[
x
[
0
]
for
x
in
speed_map
.
values
(
)
]
# 对应颜色
colors
=
[
x
[
1
]
for
x
in
speed_map
.
values
(
)
]
# 画柱状图,横轴是动物标签的位置,纵轴是速度,定义柱的宽度,同时设置柱的边缘为透明
bars
=
ax
.
bar
(
xticks
,
speeds
,
width
=
bar_width
,
edgecolor
=
'none'
)
# 设置y轴的标题
ax
.
set_ylabel
(
'Speed(km/h)'
)
# x轴每个标签的具体位置,设置为每个柱的中央
ax
.
set_xticks
(
xticks
+
bar_width
/
2
)
# 设置每个标签的名字
ax
.
set_xticklabels
(
animals
)
# 设置x轴的范围
ax
.
set_xlim
(
[
bar_width
/
2
-
0.5
,
3
-
bar_width
/
2
]
)
# 设置y轴的范围
ax
.
set_ylim
(
[
0
,
125
]
)
# 给每个bar分配指定的颜色
for
bar
,
color
in
zip
(
bars
,
colors
)
:
bar
.
set_color
(
color
)
# 在122位置加入新的图
ax
=
fig
.
add_subplot
(
122
)
ax
.
set_title
(
'Running speed - pie chart'
)
# 生成同时包含名称和速度的标签
labels
=
[
'{}\n{} km/h'
.
format
(
animal
,
speed
)
for
animal
,
speed
in
zip
(
animals
,
speeds
)
]
# 画饼状图,并指定标签和对应颜色
ax
.
pie
(
speeds
,
labels
=
labels
,
colors
=
colors
)
plt
.
show
(
)
|
在这段代码中又出现了一个新的东西叫做,一个用ax命名的对象。在Matplotlib中,画图时有两个常用概念,一个是平时画图蹦出的一个窗口,这叫一个figure。Figure相当于一个大的画布,在每个figure中,又可以存在多个子图,这种子图叫做axes。顾名思义,有了横纵轴就是一幅简单的图表。在上面代码中,先把figure定义成了一个一行两列的大画布,然后通过fig.add_subplot()加入两个新的子图。subplot的定义格式很有趣,数字的前两位分别定义行数和列数,最后一位定义新加入子图的所处顺序,当然想写明确些也没问题,用逗号分开即可。。上面这段代码产生的图像如下:

原文地址:https://mp.youkuaiyun.com/postedit/80699358