用Python进行新型冠状病毒（COVID-19/2019-nCoV）疫情分析

最新推荐文章于 2024-04-03 09:31:44 发布

spiritLHL

最新推荐文章于 2024-04-03 09:31:44 发布

阅读量1.6w

点赞数 14

分类专栏： Python数据分析文章标签：数据分析 Python 数据分析 NumPy 中文指南 pandas python Matplotlib 可视化

本文链接：https://blog.youkuaiyun.com/spiritLHL/article/details/107124396

版权

新型冠状病毒（COVID-19/2019-nCoV）疫情分析

祈LHL

重要说明

分析文档：完成度：代码质量 3:5:2

其中分析文档是指你数据分析的过程中，对各问题分析的思路、对结果的解释、说明(要求言简意赅，不要为写而写)

ps:你自己写的代码胜过一切的代笔，无关美丑，只问今日比昨日更长进！加油！

由于数据过多，查看数据尽量使用head()或tail()，以免程序长时间无响应

=======================

本项目数据来源于丁香园。本项目主要目的是通过对疫情历史数据的分析研究，以更好的了解疫情与疫情的发展态势，为抗击疫情之决策提供数据支持。

关于本章使用的数据集，欢迎点击——>我的B站视频在评论区获取。

一. 提出问题

从全国范围，你所在省市，国外疫情等三个方面主要研究以下几个问题：

（一）全国累计确诊/疑似/治愈/死亡情况随时间变化趋势如何？

（二）全国新增确诊/疑似/治愈/死亡情况随时间变化趋势如何？

（三）全国新增境外输入随时间变化趋势如何？

（四）你所在的省市情况如何？

（五）国外疫情态势如何？

（六）结合你的分析结果，对个人和社会在抗击疫情方面有何建议？

二. 理解数据

原始数据集：AreaInfo.csv，导入相关包及读取数据：

r_hex = '#dc2624'     # red,       RGB = 220,38,36
dt_hex = '#2b4750'    # dark teal, RGB = 43,71,80
tl_hex = '#45a0a2'    # teal,      RGB = 69,160,162
r1_hex = '#e87a59'    # red,       RGB = 232,122,89
tl1_hex = '#7dcaa9'   # teal,      RGB = 125,202,169
g_hex = '#649E7D'     # green,     RGB = 100,158,125
o_hex = '#dc8018'     # orange,    RGB = 220,128,24
tn_hex = '#C89F91'    # tan,       RGB = 200,159,145
g50_hex = '#6c6d6c'   # grey-50,   RGB = 108,109,108
bg_hex = '#4f6268'    # blue grey, RGB = 79,98,104
g25_hex = '#c7cccf'   # grey-25,   RGB = 199,204,207

import numpy as np
import pandas as pd
import matplotlib,re
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator


data = pd.read_csv(r'data/AreaInfo.csv')

查看与统计数据，以对数据有一个大致了解

data.head()

	continentName	continentEnglishName	countryName	countryEnglishName	provinceName	provinceEnglishName	province_zipCode	province_confirmedCount	province_curedCount	province_deadCount	updateTime	cityName	cityEnglishName	city_zipCode	city_confirmedCount	city_suspectedCount	city_curedCount	city_deadCount
0	北美洲	North America	美国	United States of America	美国	United States of America	971002	2306247	640198	120351	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	南美洲	South America	巴西	Brazil	巴西	Brazil	973003	1106470	549386	51271	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	欧洲	Europe	英国	United Kingdom	英国	United Kingdom	961007	305289	539	42647	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	欧洲	Europe	俄罗斯	Russia	俄罗斯	Russia	964006	592280	344416	8206	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	南美洲	South America	智利	Chile	智利	Chile	973004	246963	44946	4502	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN

三. 数据清洗

（一）基本数据处理

数据清洗主要包括：选取子集，缺失数据处理、数据格式转换、异常值数据处理等。

国内疫情数据选取（最终选取的数据命名为china）

选取国内疫情数据
对于更新时间(updateTime)列，需将其转换为日期类型并提取出年-月-日，并查看处理结果。(提示：dt.date)
因数据每天按小时更新，一天之内有很多重复数据，请去重并只保留一天之内最新的数据。

提示：df.drop_duplicates(subset=[‘provinceName’, ‘updateTime’], keep=‘first’, inplace=False)

其中df是你选择的国内疫情数据的DataFrame

分析：选取countryName一列中值为中国的行组成CHINA。

CHINA = data.loc[data['countryName'] == '中国']
CHINA.dropna(subset=['cityName'], how='any', inplace=True)
#CHINA

D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

分析：取出含所有中国城市的列表

cities = list(set(CHINA['cityName']))

分析：遍历取出每一个城市的子dataframe，然后用sort对updateTime进行时间排序

for city in cities:
    CHINA.loc[data['cityName'] == city].sort_values(by = 'updateTime')

分析：去除空值所在行

CHINA.dropna(subset=['cityName'],inplace=True)
#CHINA.loc[CHINA['cityName'] == '秦皇岛'].tail(20)

D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

分析：将CHINA中的updateTime列进行格式化处理

CHINA.updateTime = pd.to_datetime(CHINA.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#CHINA.loc[data['cityName'] == '秦皇岛'].tail(15)

D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value

CHINA.head()

	continentName	continentEnglishName	countryName	countryEnglishName	provinceName	provinceEnglishName	province_zipCode	province_confirmedCount	province_suspectedCount	province_curedCount	province_deadCount	updateTime	cityName	cityEnglishName	city_zipCode	city_confirmedCount	city_curedCount	city_deadCount
136	亚洲	Asia	中国	China	陕西省	Shaanxi	610000	317	1.0	307	3	2020-06-23	境外输入	NaN	0.0	72.0	65.0	0.0
137	亚洲	Asia	中国	China	陕西省	Shaanxi	610000	317	1.0	307	3	2020-06-23	西安	Xi'an	610100.0	120.0	117.0	3.0
138	亚洲	Asia	中国	China	陕西省	Shaanxi	610000	317	1.0	307	3	2020-06-23	安康	Ankang	610900.0	26.0	26.0	0.0
139	亚洲	Asia	中国	China	陕西省	Shaanxi	610000	317	1.0	307	3	2020-06-23	汉中	Hanzhong	610700.0	26.0	26.0	0.0
140	亚洲	Asia	中国	China	陕西省	Shaanxi	610000	317	1.0	307	3	2020-06-23	咸阳	Xianyang	610400.0	17.0	17.0	0.0

分析：每日数据的去重只保留第一个数据，因为前面已经对时间进行排序，第一个数据即为当天最新数据
分析：考虑到合并dataframe需要用到concat，需要创建一个初始china

real = CHINA.loc[data['cityName'] == cities[1]]
real.drop_duplicates(subset='updateTime', keep='first', inplace=True)
china = real

D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

分析：遍历每个城市dataframe进行每日数据的去重，否则会出现相同日期只保留一个城市的数据的情况

for city in cities[2:]:
    real_data = CHINA.loc[data['cityName'] == city]
    real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)
    china = pd.concat([real_data, china],sort=False)

D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

查看数据信息，是否有缺失数据/数据类型是否正确。

提示：若不会处理缺失值，可以将其舍弃

分析：有的城市不是每日都上报的，如果某日只统计上报的那些城市，那些存在患者却不上报的城市就会被忽略，数据就失真了，需要补全所有城市每日的数据，即便不上报的城市也要每日记录数据统计，所以要进行插值处理补全部分数据，处理方法详见数据透视与分析

china.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32812 entries, 96106 to 208267
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   continentName            32812 non-null  object 
 1   continentEnglishName     32812 non-null  object 
 2   countryName              32812 non-null  object 
 3   countryEnglishName       32812 non-null  object 
 4   provinceName             32812 non-null  object 
 5   provinceEnglishName      32812 non-null  object 
 6   province_zipCode         32812 non-null  int64  
 7   province_confirmedCount  32812 non-null  int64  
 8   province_suspectedCount  32812 non-null  float64
 9   province_curedCount      32812 non-null  int64  
 10  province_deadCount       32812 non-null  int64  
 11  updateTime               32812 non-null  object 
 12  cityName                 32812 non-null  object 
 13  cityEnglishName          31968 non-null  object 
 14  city_zipCode             32502 non-null  float64
 15  city_confirmedCount      32812 non-null  float64
 16  city_suspectedCount      32812 non-null  float64
 17  city_curedCount          32812 non-null  float64
 18  city_deadCount           32812 non-null  float64
dtypes: float64(6), int64(4), object(9)
memory usage: 5.0+ MB

china.head()

	continentName	continentEnglishName	countryName	countryEnglishName	provinceName	provinceEnglishName	province_zipCode	province_confirmedCount	province_curedCount	province_deadCount	updateTime	cityName	cityEnglishName	city_zipCode	city_confirmedCount	city_curedCount
96106	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	254	252	2	2020-04-02	贵港	Guigang	450800.0	8.0	8.0
125120	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	254	250	2	2020-03-20	贵港	Guigang	450800.0	8.0	8.0
128762	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	253	250	2	2020-03-18	贵港	Guigang	450800.0	8.0	8.0
130607	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	253	248	2	2020-03-17	贵港	Guigang	450800.0	8.0	8.0
131428	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	252	248	2	2020-03-16	贵港	Guigang	450800.0	8.0	8.0

你所在省市疫情数据选取（最终选取的数据命名为myhome）

此步也可在后面用到的再做

myhome = china.loc[data['provinceName'] == '广东省']
myhome.head()

	continentName	continentEnglishName	countryName	countryEnglishName	provinceName	provinceEnglishName	province_zipCode	province_confirmedCount	province_suspectedCount	province_curedCount	province_deadCount	updateTime	cityName	cityEnglishName	city_zipCode	city_confirmedCount	city_curedCount
205259	亚洲	Asia	中国	China	广东省	Guangdong	440000	277	0.0	5	0	2020-01-29	外地来粤人员	NaN	NaN	5.0	0.0
206335	亚洲	Asia	中国	China	广东省	Guangdong	440000	207	0.0	4	0	2020-01-28	河源市	NaN	NaN	1.0	0.0
205239	亚洲	Asia	中国	China	广东省	Guangdong	440000	277	0.0	5	0	2020-01-29	外地来穗人员	NaN	NaN	5.0	0.0
252	亚洲	Asia	中国	China	广东省	Guangdong	440000	1634	11.0	1619	8	2020-06-23	潮州	Chaozhou	445100.0	6.0	6.0
2655	亚洲	Asia	中国	China	广东省	Guangdong	440000	1634	11.0	1614	8	2020-06-21	潮州	Chaozhou	445100.0	6.0	6.0

国外疫情数据选取（最终选取的数据命名为world）

此步也可在后面用到的再做

world = data.loc[data['countryName'] != '中国']
world.head()

	continentName	continentEnglishName	countryName	countryEnglishName	provinceName	provinceEnglishName	province_zipCode	province_confirmedCount	province_curedCount	province_deadCount	updateTime	cityName	cityEnglishName	city_zipCode	city_confirmedCount	city_suspectedCount	city_curedCount	city_deadCount
0	北美洲	North America	美国	United States of America	美国	United States of America	971002	2306247	640198	120351	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	南美洲	South America	巴西	Brazil	巴西	Brazil	973003	1106470	549386	51271	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	欧洲	Europe	英国	United Kingdom	英国	United Kingdom	961007	305289	539	42647	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	欧洲	Europe	俄罗斯	Russia	俄罗斯	Russia	964006	592280	344416	8206	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	南美洲	South America	智利	Chile	智利	Chile	973004	246963	44946	4502	2020-06-23 10:01:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN

数据透视与分析

分析：对china进行插值处理补全部分数据

china.head()

	continentName	continentEnglishName	countryName	countryEnglishName	provinceName	provinceEnglishName	province_zipCode	province_confirmedCount	province_curedCount	province_deadCount	updateTime	cityName	cityEnglishName	city_zipCode	city_confirmedCount	city_curedCount
96106	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	254	252	2	2020-04-02	贵港	Guigang	450800.0	8.0	8.0
125120	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	254	250	2	2020-03-20	贵港	Guigang	450800.0	8.0	8.0
128762	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	253	250	2	2020-03-18	贵港	Guigang	450800.0	8.0	8.0
130607	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	253	248	2	2020-03-17	贵港	Guigang	450800.0	8.0	8.0
131428	亚洲	Asia	中国	China	广西壮族自治区	Guangxi	450000	252	248	2	2020-03-16	贵港	Guigang	450800.0	8.0	8.0

分析：先创建省份列表和日期列表，并初始化一个draft

province = list(set(china['provinceName']))#每个省份
#p_city = list(set(china[china['provinceName'] == province[0]]['cityName']))#每个省份的城市
date_0 = []
for dt in china.loc[china['provinceName'] ==  province[0]]['updateTime']:
    date_0.append(str(dt))
date_0 = list(set(date_0))
date_0.sort()
start = china.loc[china['provinceName'] ==  province[0]]['updateTime'].min()
end = china.loc[china['provinceName'] ==  province[0]]['updateTime'].max()
dates = pd.date_range(start=str(start), end=str(end))
aid_frame = pd.DataFrame({
   'updateTime': dates,'provinceName':[province[0]]*len(dates)})
aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#draft = pd.merge(china.loc[china['provinceName'] ==  province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')
draft = pd.concat([china.loc[china['provinceName'] ==  province[0]], aid_frame], join='outer').sort_values('updateTime')
draft.province_confirmedCount.fillna(method="ffill",inplace=True)
draft.province_suspectedCount.fillna(method="ffill", inplace=True)
draft.province_curedCount.fillna(method="ffill", inplace=True)
draft.province_deadCount.fillna(method="ffill", inplace=True)

分析：补全部分时间，取前日的数据进行插值，因为有的省份从4月末开始陆续就不再有新增病患，不再上报，所以这些省份的数据只能补全到4月末，往后的数据逐渐失去真实性

分析：同时进行日期格式化

for p in range(1,len(province)):
    date_d = []
    for dt in china.loc[china['provinceName'] ==  province[p]]['updateTime']:
        date_d.append(dt)
    date_d = list(set(date_d))
    date_d.sort()
    start = china.loc[china['provinceName'] ==  province[p]]['updateTime'].min()
    end = china.loc[china['provinceName'] ==  province[p]]['updateTime'].max()
    dates = pd.date_range(start=start, end=end)
    aid_frame = pd.DataFrame({
   'updateTime': dates,'provinceName':[province[p]]*len(dates)})
    aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
    X =

最低0.47元/天解锁文章