机器学习-数据科学库 12 美国人口数据分析案例

本文通过数据准备、预处理及分析展示了美国各州的人口统计数据。通过对人口、面积等数据的整合与清洗,最终计算出了各州的人口密度,并确定了人口密度最高的州。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

数据准备

参考:https://github.com/jakevdp/data-USstates

import pandas as pd

# 导入文件,查看原始数据
# state 州的全称,abbreviation 州的简称
abb = pd.read_csv("./state-abbrevs.csv")

# state 州的全称,area 占地面积
area = pd.read_csv("./state-areas.csv")

# state/region 州的简称,ages 年龄,year 时间,population 人数
pop = pd.read_csv("./state-population.csv")

预处理

数据外联合并

# 将人口数据和各州简称数据进行合并(为了保证数据的完整性,采用外连接的方式)
abb_pop = pd.merge(abb, pop, left_on="abbreviation", right_on="state/region", how="outer")
print(abb_pop.head(5))

# 对合并后的数据删除重复列:abbreviation 和 state/region 等效
abb_pop.drop(labels="abbreviation", axis=1, inplace=True)
print(abb_pop.head(5))

输出:

     state abbreviation state/region     ages  year  population
0  Alabama           AL           AL  under18  2012   1117489.0
1  Alabama           AL           AL    total  2012   4817528.0
2  Alabama           AL           AL  under18  2010   1130966.0
3  Alabama           AL           AL    total  2010   4785570.0
4  Alabama           AL           AL  under18  2011   1125763.0
     state state/region     ages  year  population
0  Alabama           AL  under18  2012   1117489.0
1  Alabama           AL    total  2012   4817528.0
2  Alabama           AL  under18  2010   1130966.0
3  Alabama           AL    total  2010   4785570.0
4  Alabama           AL  under18  2011   1125763.0

空值处理

# 查看空值
print(abb_pop.isnull().any(axis=0))
print(abb_pop.info())

# 定位 state 为空的行
state_null_series = abb_pop.loc[abb_pop['state'].isnull()]['state/region']

# 查看去重后的 state
print(state_null_series.unique())

# USA 的全称对应的空值进行人工批量填充
usa_indexes = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
abb_pop.loc[usa_indexes, 'state'] = 'United States'
print(abb_pop.loc[abb_pop['state/region'] == 'USA'].head(5))

# PR 的全称对应的空值进行人工批量填充
pr_indexes = abb_pop.loc[abb_pop['state/region'] == 'PR'].index
abb_pop.loc[pr_indexes, 'state'] = 'Puerto Rico'
print(abb_pop.loc[abb_pop['state/region'] == 'PR'].head(5))

输出:

state            True
state/region    False
ages            False
year            False
population       True
dtype: bool
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   state         2448 non-null   object 
 1   state/region  2544 non-null   object 
 2   ages          2544 non-null   object 
 3   year          2544 non-null   int64  
 4   population    2524 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 119.2+ KB
None
['PR' 'USA']
              state state/region     ages  year   population
2496  United States          USA  under18  1990   64218512.0
2497  United States          USA    total  1990  249622814.0
2498  United States          USA    total  1991  252980942.0
2499  United States          USA  under18  1991   65313018.0
2500  United States          USA  under18  1992   66509177.0

            state state/region     ages  year  population
2448  Puerto Rico           PR  under18  1990         NaN
2449  Puerto Rico           PR    total  1990         NaN
2450  Puerto Rico           PR    total  1991         NaN
2451  Puerto Rico           PR  under18  1991         NaN
2452  Puerto Rico           PR    total  1993         NaN

数据分析

# 合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop, area, left_on='state', right_on='state', how='outer')

# 去掉 area 缺失的行
abb_pop_area_null_indexes = abb_pop_area[abb_pop_area['area (sq. mi)'].isnull()].index
print(abb_pop_area.iloc[abb_pop_area_null_indexes].head(5))
abb_pop_area.drop(labels=abb_pop_area_null_indexes, axis=0, inplace=True)

# 找出2010年的全民人口数据
print(abb_pop_area.query('ages == "total" & year == 2010').head(5))

# 计算各州的人口密度(人口除以面积)
abb_pop_area['density'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
print(abb_pop_area.head(5))

# 排序,并找出人口密度最高的州
most_density_state_str = abb_pop_area.sort_values(by='density', axis=0, ascending=False).iloc[0]['state']
print(most_density_state_str)

输出:

              state state/region     ages  year   population  area (sq. mi)
2496  United States          USA  under18  1990   64218512.0            NaN
2497  United States          USA    total  1990  249622814.0            NaN
2498  United States          USA    total  1991  252980942.0            NaN
2499  United States          USA  under18  1991   65313018.0            NaN
2500  United States          USA  under18  1992   66509177.0            NaN
          state state/region   ages  year  population  area (sq. mi)
3       Alabama           AL  total  2010   4785570.0        52423.0
91       Alaska           AK  total  2010    713868.0       656425.0
101     Arizona           AZ  total  2010   6408790.0       114006.0
189    Arkansas           AR  total  2010   2922280.0        53182.0
197  California           CA  total  2010  37333601.0       163707.0
     state state/region     ages  year  population  area (sq. mi)    density
0  Alabama           AL  under18  2012   1117489.0        52423.0  21.316769
1  Alabama           AL    total  2012   4817528.0        52423.0  91.897221
2  Alabama           AL  under18  2010   1130966.0        52423.0  21.573851
3  Alabama           AL    total  2010   4785570.0        52423.0  91.287603
4  Alabama           AL  under18  2011   1125763.0        52423.0  21.474601
District of Columbia
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值