用Python进行新型冠状病毒(COVID-19/2019-nCoV)疫情分析

新型冠状病毒(COVID-19/2019-nCoV)疫情分析

祈LHL

重要说明

分析文档:完成度:代码质量 3:5:2

其中分析文档是指你数据分析的过程中,对各问题分析的思路、对结果的解释、说明(要求言简意赅,不要为写而写)

ps:你自己写的代码胜过一切的代笔,无关美丑,只问今日比昨日更长进!加油!

由于数据过多,查看数据尽量使用head()或tail(),以免程序长时间无响应

=======================

本项目数据来源于丁香园。本项目主要目的是通过对疫情历史数据的分析研究,以更好的了解疫情与疫情的发展态势,为抗击疫情之决策提供数据支持。

关于本章使用的数据集,欢迎点击——>我的B站视频 在评论区获取。

一. 提出问题

从全国范围,你所在省市,国外疫情等三个方面主要研究以下几个问题:

(一)全国累计确诊/疑似/治愈/死亡情况随时间变化趋势如何?

(二)全国新增确诊/疑似/治愈/死亡情况随时间变化趋势如何?

(三)全国新增境外输入随时间变化趋势如何?

(四)你所在的省市情况如何?

(五)国外疫情态势如何?

(六)结合你的分析结果,对个人和社会在抗击疫情方面有何建议?

二. 理解数据

原始数据集:AreaInfo.csv,导入相关包及读取数据:

r_hex = '#dc2624'     # red,       RGB = 220,38,36
dt_hex = '#2b4750'    # dark teal, RGB = 43,71,80
tl_hex = '#45a0a2'    # teal,      RGB = 69,160,162
r1_hex = '#e87a59'    # red,       RGB = 232,122,89
tl1_hex = '#7dcaa9'   # teal,      RGB = 125,202,169
g_hex = '#649E7D'     # green,     RGB = 100,158,125
o_hex = '#dc8018'     # orange,    RGB = 220,128,24
tn_hex = '#C89F91'    # tan,       RGB = 200,159,145
g50_hex = '#6c6d6c'   # grey-50,   RGB = 108,109,108
bg_hex = '#4f6268'    # blue grey, RGB = 79,98,104
g25_hex = '#c7cccf'   # grey-25,   RGB = 199,204,207
import numpy as np
import pandas as pd
import matplotlib,re
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator


data = pd.read_csv(r'data/AreaInfo.csv')

查看与统计数据,以对数据有一个大致了解

data.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
0 北美洲 North America 美国 United States of America 美国 United States of America 971002 2306247 0.0 640198 120351 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
1 南美洲 South America 巴西 Brazil 巴西 Brazil 973003 1106470 0.0 549386 51271 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
2 欧洲 Europe 英国 United Kingdom 英国 United Kingdom 961007 305289 0.0 539 42647 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
3 欧洲 Europe 俄罗斯 Russia 俄罗斯 Russia 964006 592280 0.0 344416 8206 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
4 南美洲 South America 智利 Chile 智利 Chile 973004 246963 0.0 44946 4502 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN

三. 数据清洗

(一)基本数据处理

数据清洗主要包括:选取子集,缺失数据处理、数据格式转换、异常值数据处理等。

国内疫情数据选取(最终选取的数据命名为china)
  1. 选取国内疫情数据

  2. 对于更新时间(updateTime)列,需将其转换为日期类型并提取出年-月-日,并查看处理结果。(提示:dt.date)

  3. 因数据每天按小时更新,一天之内有很多重复数据,请去重并只保留一天之内最新的数据。

提示:df.drop_duplicates(subset=[‘provinceName’, ‘updateTime’], keep=‘first’, inplace=False)

其中df是你选择的国内疫情数据的DataFrame

分析:选取countryName一列中值为中国的行组成CHINA。

CHINA = data.loc[data['countryName'] == '中国']
CHINA.dropna(subset=['cityName'], how='any', inplace=True)
#CHINA
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

分析:取出含所有中国城市的列表

cities = list(set(CHINA['cityName']))

分析:遍历取出每一个城市的子dataframe,然后用sort对updateTime进行时间排序

for city in cities:
    CHINA.loc[data['cityName'] == city].sort_values(by = 'updateTime')

分析:去除空值所在行

CHINA.dropna(subset=['cityName'],inplace=True)
#CHINA.loc[CHINA['cityName'] == '秦皇岛'].tail(20)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

分析:将CHINA中的updateTime列进行格式化处理

CHINA.updateTime = pd.to_datetime(CHINA.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#CHINA.loc[data['cityName'] == '秦皇岛'].tail(15)
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
CHINA.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
136 亚洲 Asia 中国 China 陕西省 Shaanxi 610000 317 1.0 307 3 2020-06-23 境外输入 NaN 0.0 72.0 0.0 65.0 0.0
137 亚洲 Asia 中国 China 陕西省 Shaanxi 610000 317 1.0 307 3 2020-06-23 西安 Xi'an 610100.0 120.0 0.0 117.0 3.0
138 亚洲 Asia 中国 China 陕西省 Shaanxi 610000 317 1.0 307 3 2020-06-23 安康 Ankang 610900.0 26.0 0.0 26.0 0.0
139 亚洲 Asia 中国 China 陕西省 Shaanxi 610000 317 1.0 307 3 2020-06-23 汉中 Hanzhong 610700.0 26.0 0.0 26.0 0.0
140 亚洲 Asia 中国 China 陕西省 Shaanxi 610000 317 1.0 307 3 2020-06-23 咸阳 Xianyang 610400.0 17.0 0.0 17.0 0.0

分析:每日数据的去重只保留第一个数据,因为前面已经对时间进行排序,第一个数据即为当天最新数据
分析:考虑到合并dataframe需要用到concat,需要创建一个初始china

real = CHINA.loc[data['cityName'] == cities[1]]
real.drop_duplicates(subset='updateTime', keep='first', inplace=True)
china = real
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

分析:遍历每个城市dataframe进行每日数据的去重,否则会出现相同日期只保留一个城市的数据的情况

for city in cities[2:]:
    real_data = CHINA.loc[data['cityName'] == city]
    real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)
    china = pd.concat([real_data, china],sort=False)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

查看数据信息,是否有缺失数据/数据类型是否正确。

提示:若不会处理缺失值,可以将其舍弃

分析:有的城市不是每日都上报的,如果某日只统计上报的那些城市,那些存在患者却不上报的城市就会被忽略,数据就失真了,需要补全所有城市每日的数据,即便不上报的城市也要每日记录数据统计,所以要进行插值处理补全部分数据,处理方法详见数据透视与分析

china.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32812 entries, 96106 to 208267
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   continentName            32812 non-null  object 
 1   continentEnglishName     32812 non-null  object 
 2   countryName              32812 non-null  object 
 3   countryEnglishName       32812 non-null  object 
 4   provinceName             32812 non-null  object 
 5   provinceEnglishName      32812 non-null  object 
 6   province_zipCode         32812 non-null  int64  
 7   province_confirmedCount  32812 non-null  int64  
 8   province_suspectedCount  32812 non-null  float64
 9   province_curedCount      32812 non-null  int64  
 10  province_deadCount       32812 non-null  int64  
 11  updateTime               32812 non-null  object 
 12  cityName                 32812 non-null  object 
 13  cityEnglishName          31968 non-null  object 
 14  city_zipCode             32502 non-null  float64
 15  city_confirmedCount      32812 non-null  float64
 16  city_suspectedCount      32812 non-null  float64
 17  city_curedCount          32812 non-null  float64
 18  city_deadCount           32812 non-null  float64
dtypes: float64(6), int64(4), object(9)
memory usage: 5.0+ MB
china.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
96106 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 254 0.0 252 2 2020-04-02 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0
125120 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 254 0.0 250 2 2020-03-20 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0
128762 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 253 0.0 250 2 2020-03-18 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0
130607 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 253 0.0 248 2 2020-03-17 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0
131428 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 252 0.0 248 2 2020-03-16 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0
你所在省市疫情数据选取(最终选取的数据命名为myhome)

此步也可在后面用到的再做

myhome = china.loc[data['provinceName'] == '广东省']
myhome.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
205259 亚洲 Asia 中国 China 广东省 Guangdong 440000 277 0.0 5 0 2020-01-29 外地来粤人员 NaN NaN 5.0 0.0 0.0 0.0
206335 亚洲 Asia 中国 China 广东省 Guangdong 440000 207 0.0 4 0 2020-01-28 河源市 NaN NaN 1.0 0.0 0.0 0.0
205239 亚洲 Asia 中国 China 广东省 Guangdong 440000 277 0.0 5 0 2020-01-29 外地来穗人员 NaN NaN 5.0 0.0 0.0 0.0
252 亚洲 Asia 中国 China 广东省 Guangdong 440000 1634 11.0 1619 8 2020-06-23 潮州 Chaozhou 445100.0 6.0 0.0 6.0 0.0
2655 亚洲 Asia 中国 China 广东省 Guangdong 440000 1634 11.0 1614 8 2020-06-21 潮州 Chaozhou 445100.0 6.0 0.0 6.0 0.0
国外疫情数据选取(最终选取的数据命名为world)

此步也可在后面用到的再做

world = data.loc[data['countryName'] != '中国']
world.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
0 北美洲 North America 美国 United States of America 美国 United States of America 971002 2306247 0.0 640198 120351 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
1 南美洲 South America 巴西 Brazil 巴西 Brazil 973003 1106470 0.0 549386 51271 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
2 欧洲 Europe 英国 United Kingdom 英国 United Kingdom 961007 305289 0.0 539 42647 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
3 欧洲 Europe 俄罗斯 Russia 俄罗斯 Russia 964006 592280 0.0 344416 8206 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
4 南美洲 South America 智利 Chile 智利 Chile 973004 246963 0.0 44946 4502 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN

数据透视与分析

分析:对china进行插值处理补全部分数据

china.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
96106 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 254 0.0 252 2 2020-04-02 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0
125120 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 254 0.0 250 2 2020-03-20 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0
128762 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 253 0.0 250 2 2020-03-18 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0
130607 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 253 0.0 248 2 2020-03-17 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0
131428 亚洲 Asia 中国 China 广西壮族自治区 Guangxi 450000 252 0.0 248 2 2020-03-16 贵港 Guigang 450800.0 8.0 0.0 8.0 0.0

分析:先创建省份列表和日期列表,并初始化一个draft

province = list(set(china['provinceName']))#每个省份
#p_city = list(set(china[china['provinceName'] == province[0]]['cityName']))#每个省份的城市
date_0 = []
for dt in china.loc[china['provinceName'] ==  province[0]]['updateTime']:
    date_0.append(str(dt))
date_0 = list(set(date_0))
date_0.sort()
start = china.loc[china['provinceName'] ==  province[0]]['updateTime'].min()
end = china.loc[china['provinceName'] ==  province[0]]['updateTime'].max()
dates = pd.date_range(start=str(start), end=str(end))
aid_frame = pd.DataFrame({
   'updateTime': dates,'provinceName':[province[0]]*len(dates)})
aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#draft = pd.merge(china.loc[china['provinceName'] ==  province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')
draft = pd.concat([china.loc[china['provinceName'] ==  province[0]], aid_frame], join='outer').sort_values('updateTime')
draft.province_confirmedCount.fillna(method="ffill",inplace=True)
draft.province_suspectedCount.fillna(method="ffill", inplace=True)
draft.province_curedCount.fillna(method="ffill", inplace=True)
draft.province_deadCount.fillna(method="ffill", inplace=True)

分析:补全部分时间,取前日的数据进行插值,因为有的省份从4月末开始陆续就不再有新增病患,不再上报,所以这些省份的数据只能补全到4月末,往后的数据逐渐失去真实性

分析:同时进行日期格式化

for p in range(1,len(province)):
    date_d = []
    for dt in china.loc[china['provinceName'] ==  province[p]]['updateTime']:
        date_d.append(dt)
    date_d = list(set(date_d))
    date_d.sort()
    start = china.loc[china['provinceName'] ==  province[p]]['updateTime'].min()
    end = china.loc[china['provinceName'] ==  province[p]]['updateTime'].max()
    dates = pd.date_range(start=start, end=end)
    aid_frame = pd.DataFrame({
   'updateTime': dates,'provinceName':[province[p]]*len(dates)})
    aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
    X =
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值