写在前面:所谓进阶,就是代码多几条,图像更花里胡哨的意思。但仍有这篇,是希望免去查帮助文档的步骤。
文章目录
内容介绍:
VIN (1-10)': The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN).
'County': The county in which the registered owner resides.
'City': The city in which the registered owner resides
'State': The state in which the registered owner resides
'Postal Code': The 5 digit zip code in which the registered owner resides
'Model Year': The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN)
'Make': The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN)
'Model': The model of the vehicle, determined by decoding the Vehicle Identification Number (VIN).
'Electric Vehicle Type': This distinguishes the vehicle as all electric or a plug-in hybrid.
'Clean Alternative Fuel Vehicle (CAFV) Eligibility': This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only range requirement in House Bill 2042 as passed in the 2019 legislative session.
'Electric Range': Describes how far a vehicle can travel purely on its electric charge.
'Base MSRP': This is the lowest Manufacturer's Suggested Retail Price (MSRP) for any trim level of the model in question.
'Legislative District': The specific section of Washington State that the vehicle's owner resides in, as represented in the state legislature.
'DOL Vehicle ID': Unique number assigned to each vehicle by Department of Licensing for identification purposes.
'Vehicle Location': The center of the ZIP Code for the registered vehicle.
'Electric Utility': This is the electric power retail service territories serving the address of the registered vehicle.
准备工作:
写两个常用计数的自定义函数:
(自定义后,自动补全更容易实现)
def vcounts(a):
return a.value_counts()
def group_mean(a,b,c):
return a.groupby(b)[c].mean()
一、清洗数据
1.简化信息
(1)删除无用信息
#df.State.value_counts() WA 181060 其他异地登记就先不看了
df=df[df['State']=='WA']
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 181060 entries, 0 to 181457
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 VIN (1-10) 181060 non-null object
1 County 181060 non-null object
2 City 181060 non-null object
3 State 181060 non-null object
4 Postal Code 181060 non-null float64
5 Model Year 181060 non-null int64
6 Make 181060 non-null object
7 Model 181060 non-null object
8 Electric Vehicle Type 181060 non-null object
9 Clean Alternative Fuel Vehicle (CAFV) Eligibility 181060 non-null object
10 Electric Range 181060 non-null int64
11 Base MSRP 181060 non-null int64
12 Legislative District 181060 non-null float64
13 DOL Vehicle ID 181060 non-null int64
14 Vehicle Location 181055 non-null object
15 Electric Utility 181060 non-null object
16 2020 Census Tract 181060 non-null float64
dtypes: float64(3), int64(4), object(10)
memory usage: 24.9+ MB
nouse=['VIN (1-10)','Postal Code','2020 Census Tract','Legislative District','Base MSRP']
dta=df.drop(df[nouse],axis=1)
(2)删除缺失值
df=df.dropna()
2.重命名简化内容
先看复杂列名里有什么:
dta['Clean Alternative Fuel Vehicle (CAFV) Eligibility'].value_counts()
结果:
Clean Alternative Fuel Vehicle (CAFV) Eligibility
Eligibility unknown as battery range has not been researched 94566
Clean Alternative Fuel Vehicle Eligible 66646
Not eligible due to low battery range 19843
dta.rename(columns={
'Clean Alternative Fuel Vehicle (CAFV) Eligibility':'CAFV'},inplace=True) #使用字典映射
dta['isCAFV']=dta['CAFV'].apply(lambda x:'unknown' if x=='Eligibility unknown as battery range has not been researched'
else 'CAFV' if x=='Clean Alternative Fuel Vehicle Eligible'
else 'NOT')
dta['isCAFV'].value_counts()#验证
另一列dta[‘Electric Vehicle Type’]同理
datetime
dta['year']=pd.to_datetime(dta['Model Year'],format='%Y').dt.year #时间格式练习,可不用 #因只有year,需要声明
二、可视化
1.电车拥有量前十的县
希望在柱状图里嵌套个饼状图(后期)
county_top=vcounts(dta['County'])[0:10];county_top
county_top_pair=[(k,v) for k,v in county_top.items()];county_top_pair
#推导式封装
from pyecharts.charts import Bar
county_bar=