新型冠状病毒(COVID-19/2019-nCoV)疫情分析
祈LHL
重要说明
分析文档:完成度:代码质量 3:5:2
其中分析文档是指你数据分析的过程中,对各问题分析的思路、对结果的解释、说明(要求言简意赅,不要为写而写)
ps:你自己写的代码胜过一切的代笔,无关美丑,只问今日比昨日更长进!加油!
由于数据过多,查看数据尽量使用head()或tail(),以免程序长时间无响应
=======================
本项目数据来源于丁香园。本项目主要目的是通过对疫情历史数据的分析研究,以更好的了解疫情与疫情的发展态势,为抗击疫情之决策提供数据支持。
关于本章使用的数据集,欢迎点击——>我的B站视频 在评论区获取。
一. 提出问题
从全国范围,你所在省市,国外疫情等三个方面主要研究以下几个问题:
(一)全国累计确诊/疑似/治愈/死亡情况随时间变化趋势如何?
(二)全国新增确诊/疑似/治愈/死亡情况随时间变化趋势如何?
(三)全国新增境外输入随时间变化趋势如何?
(四)你所在的省市情况如何?
(五)国外疫情态势如何?
(六)结合你的分析结果,对个人和社会在抗击疫情方面有何建议?
二. 理解数据
原始数据集:AreaInfo.csv,导入相关包及读取数据:
r_hex = '#dc2624' # red, RGB = 220,38,36
dt_hex = '#2b4750' # dark teal, RGB = 43,71,80
tl_hex = '#45a0a2' # teal, RGB = 69,160,162
r1_hex = '#e87a59' # red, RGB = 232,122,89
tl1_hex = '#7dcaa9' # teal, RGB = 125,202,169
g_hex = '#649E7D' # green, RGB = 100,158,125
o_hex = '#dc8018' # orange, RGB = 220,128,24
tn_hex = '#C89F91' # tan, RGB = 200,159,145
g50_hex = '#6c6d6c' # grey-50, RGB = 108,109,108
bg_hex = '#4f6268' # blue grey, RGB = 79,98,104
g25_hex = '#c7cccf' # grey-25, RGB = 199,204,207
import numpy as np
import pandas as pd
import matplotlib,re
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator
data = pd.read_csv(r'data/AreaInfo.csv')
查看与统计数据,以对数据有一个大致了解
data.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 北美洲 | North America | 美国 | United States of America | 美国 | United States of America | 971002 | 2306247 | 0.0 | 640198 | 120351 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 南美洲 | South America | 巴西 | Brazil | 巴西 | Brazil | 973003 | 1106470 | 0.0 | 549386 | 51271 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 欧洲 | Europe | 英国 | United Kingdom | 英国 | United Kingdom | 961007 | 305289 | 0.0 | 539 | 42647 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 欧洲 | Europe | 俄罗斯 | Russia | 俄罗斯 | Russia | 964006 | 592280 | 0.0 | 344416 | 8206 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 南美洲 | South America | 智利 | Chile | 智利 | Chile | 973004 | 246963 | 0.0 | 44946 | 4502 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
三. 数据清洗
(一)基本数据处理
数据清洗主要包括:选取子集,缺失数据处理、数据格式转换、异常值数据处理等。
国内疫情数据选取(最终选取的数据命名为china)
-
选取国内疫情数据
-
对于更新时间(updateTime)列,需将其转换为日期类型并提取出年-月-日,并查看处理结果。(提示:dt.date)
-
因数据每天按小时更新,一天之内有很多重复数据,请去重并只保留一天之内最新的数据。
提示:df.drop_duplicates(subset=[‘provinceName’, ‘updateTime’], keep=‘first’, inplace=False)
其中df是你选择的国内疫情数据的DataFrame
分析:选取countryName一列中值为中国的行组成CHINA。
CHINA = data.loc[data['countryName'] == '中国']
CHINA.dropna(subset=['cityName'], how='any', inplace=True)
#CHINA
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
分析:取出含所有中国城市的列表
cities = list(set(CHINA['cityName']))
分析:遍历取出每一个城市的子dataframe,然后用sort对updateTime进行时间排序
for city in cities:
CHINA.loc[data['cityName'] == city].sort_values(by = 'updateTime')
分析:去除空值所在行
CHINA.dropna(subset=['cityName'],inplace=True)
#CHINA.loc[CHINA['cityName'] == '秦皇岛'].tail(20)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
分析:将CHINA中的updateTime列进行格式化处理
CHINA.updateTime = pd.to_datetime(CHINA.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#CHINA.loc[data['cityName'] == '秦皇岛'].tail(15)
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[name] = value
CHINA.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
136 | 亚洲 | Asia | 中国 | China | 陕西省 | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | 境外输入 | NaN | 0.0 | 72.0 | 0.0 | 65.0 | 0.0 |
137 | 亚洲 | Asia | 中国 | China | 陕西省 | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | 西安 | Xi'an | 610100.0 | 120.0 | 0.0 | 117.0 | 3.0 |
138 | 亚洲 | Asia | 中国 | China | 陕西省 | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | 安康 | Ankang | 610900.0 | 26.0 | 0.0 | 26.0 | 0.0 |
139 | 亚洲 | Asia | 中国 | China | 陕西省 | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | 汉中 | Hanzhong | 610700.0 | 26.0 | 0.0 | 26.0 | 0.0 |
140 | 亚洲 | Asia | 中国 | China | 陕西省 | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | 咸阳 | Xianyang | 610400.0 | 17.0 | 0.0 | 17.0 | 0.0 |
分析:每日数据的去重只保留第一个数据,因为前面已经对时间进行排序,第一个数据即为当天最新数据
分析:考虑到合并dataframe需要用到concat,需要创建一个初始china
real = CHINA.loc[data['cityName'] == cities[1]]
real.drop_duplicates(subset='updateTime', keep='first', inplace=True)
china = real
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
分析:遍历每个城市dataframe进行每日数据的去重,否则会出现相同日期只保留一个城市的数据的情况
for city in cities[2:]:
real_data = CHINA.loc[data['cityName'] == city]
real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)
china = pd.concat([real_data, china],sort=False)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports until
查看数据信息,是否有缺失数据/数据类型是否正确。
提示:若不会处理缺失值,可以将其舍弃
分析:有的城市不是每日都上报的,如果某日只统计上报的那些城市,那些存在患者却不上报的城市就会被忽略,数据就失真了,需要补全所有城市每日的数据,即便不上报的城市也要每日记录数据统计,所以要进行插值处理补全部分数据,处理方法详见数据透视与分析
china.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32812 entries, 96106 to 208267
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 continentName 32812 non-null object
1 continentEnglishName 32812 non-null object
2 countryName 32812 non-null object
3 countryEnglishName 32812 non-null object
4 provinceName 32812 non-null object
5 provinceEnglishName 32812 non-null object
6 province_zipCode 32812 non-null int64
7 province_confirmedCount 32812 non-null int64
8 province_suspectedCount 32812 non-null float64
9 province_curedCount 32812 non-null int64
10 province_deadCount 32812 non-null int64
11 updateTime 32812 non-null object
12 cityName 32812 non-null object
13 cityEnglishName 31968 non-null object
14 city_zipCode 32502 non-null float64
15 city_confirmedCount 32812 non-null float64
16 city_suspectedCount 32812 non-null float64
17 city_curedCount 32812 non-null float64
18 city_deadCount 32812 non-null float64
dtypes: float64(6), int64(4), object(9)
memory usage: 5.0+ MB
china.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
96106 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 254 | 0.0 | 252 | 2 | 2020-04-02 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
125120 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 254 | 0.0 | 250 | 2 | 2020-03-20 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
128762 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 253 | 0.0 | 250 | 2 | 2020-03-18 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
130607 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 253 | 0.0 | 248 | 2 | 2020-03-17 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
131428 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 252 | 0.0 | 248 | 2 | 2020-03-16 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
你所在省市疫情数据选取(最终选取的数据命名为myhome)
此步也可在后面用到的再做
myhome = china.loc[data['provinceName'] == '广东省']
myhome.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
205259 | 亚洲 | Asia | 中国 | China | 广东省 | Guangdong | 440000 | 277 | 0.0 | 5 | 0 | 2020-01-29 | 外地来粤人员 | NaN | NaN | 5.0 | 0.0 | 0.0 | 0.0 |
206335 | 亚洲 | Asia | 中国 | China | 广东省 | Guangdong | 440000 | 207 | 0.0 | 4 | 0 | 2020-01-28 | 河源市 | NaN | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
205239 | 亚洲 | Asia | 中国 | China | 广东省 | Guangdong | 440000 | 277 | 0.0 | 5 | 0 | 2020-01-29 | 外地来穗人员 | NaN | NaN | 5.0 | 0.0 | 0.0 | 0.0 |
252 | 亚洲 | Asia | 中国 | China | 广东省 | Guangdong | 440000 | 1634 | 11.0 | 1619 | 8 | 2020-06-23 | 潮州 | Chaozhou | 445100.0 | 6.0 | 0.0 | 6.0 | 0.0 |
2655 | 亚洲 | Asia | 中国 | China | 广东省 | Guangdong | 440000 | 1634 | 11.0 | 1614 | 8 | 2020-06-21 | 潮州 | Chaozhou | 445100.0 | 6.0 | 0.0 | 6.0 | 0.0 |
国外疫情数据选取(最终选取的数据命名为world)
此步也可在后面用到的再做
world = data.loc[data['countryName'] != '中国']
world.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 北美洲 | North America | 美国 | United States of America | 美国 | United States of America | 971002 | 2306247 | 0.0 | 640198 | 120351 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 南美洲 | South America | 巴西 | Brazil | 巴西 | Brazil | 973003 | 1106470 | 0.0 | 549386 | 51271 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 欧洲 | Europe | 英国 | United Kingdom | 英国 | United Kingdom | 961007 | 305289 | 0.0 | 539 | 42647 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 欧洲 | Europe | 俄罗斯 | Russia | 俄罗斯 | Russia | 964006 | 592280 | 0.0 | 344416 | 8206 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 南美洲 | South America | 智利 | Chile | 智利 | Chile | 973004 | 246963 | 0.0 | 44946 | 4502 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
数据透视与分析
分析:对china进行插值处理补全部分数据
china.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
96106 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 254 | 0.0 | 252 | 2 | 2020-04-02 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
125120 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 254 | 0.0 | 250 | 2 | 2020-03-20 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
128762 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 253 | 0.0 | 250 | 2 | 2020-03-18 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
130607 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 253 | 0.0 | 248 | 2 | 2020-03-17 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
131428 | 亚洲 | Asia | 中国 | China | 广西壮族自治区 | Guangxi | 450000 | 252 | 0.0 | 248 | 2 | 2020-03-16 | 贵港 | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
分析:先创建省份列表和日期列表,并初始化一个draft
province = list(set(china['provinceName']))#每个省份
#p_city = list(set(china[china['provinceName'] == province[0]]['cityName']))#每个省份的城市
date_0 = []
for dt in china.loc[china['provinceName'] == province[0]]['updateTime']:
date_0.append(str(dt))
date_0 = list(set(date_0))
date_0.sort()
start = china.loc[china['provinceName'] == province[0]]['updateTime'].min()
end = china.loc[china['provinceName'] == province[0]]['updateTime'].max()
dates = pd.date_range(start=str(start), end=str(end))
aid_frame = pd.DataFrame({
'updateTime': dates,'provinceName':[province[0]]*len(dates)})
aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#draft = pd.merge(china.loc[china['provinceName'] == province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')
draft = pd.concat([china.loc[china['provinceName'] == province[0]], aid_frame], join='outer').sort_values('updateTime')
draft.province_confirmedCount.fillna(method="ffill",inplace=True)
draft.province_suspectedCount.fillna(method="ffill", inplace=True)
draft.province_curedCount.fillna(method="ffill", inplace=True)
draft.province_deadCount.fillna(method="ffill", inplace=True)
分析:补全部分时间,取前日的数据进行插值,因为有的省份从4月末开始陆续就不再有新增病患,不再上报,所以这些省份的数据只能补全到4月末,往后的数据逐渐失去真实性
分析:同时进行日期格式化
for p in range(1,len(province)):
date_d = []
for dt in china.loc[china['provinceName'] == province[p]]['updateTime']:
date_d.append(dt)
date_d = list(set(date_d))
date_d.sort()
start = china.loc[china['provinceName'] == province[p]]['updateTime'].min()
end = china.loc[china['provinceName'] == province[p]]['updateTime'].max()
dates = pd.date_range(start=start, end=end)
aid_frame = pd.DataFrame({
'updateTime': dates,'provinceName':[province[p]]*len(dates)})
aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
X =