目录
3.对新数据进行总览df.info(),查看是否存在缺失数据
4.用统计学指标快速描述数值型属性的概要。df.describe()
5.空值处理。可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为NOT PROVIDE
10.查看各个党派收到的政治献金总数contb_receipt_amt
11.查看具体每天各个党派收到的政治献金总数contb_receipt_amt
13.查看老兵(捐献者职业)DISABLED VETERAN主要支持谁
14.找出各个候选人的捐赠者中,捐赠金额最大的人的职业以及捐献额
需求:
1.加载数据,查看数据的基本信息
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#1.加载数据,查看数据的基本信息
df=pd.read_csv('./data/usa_election.txt',error_bad_lines=False)
print(df.head())
print(df.info())
运行结果
cmte_id cand_id cand_nm ... memo_text form_tp file_num
0 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 736166
1 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 736166
2 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 749073
3 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 749073
4 C00410118 P20002978 Bachmann, Michelle ... NaN SA17A 736166
[5 rows x 16 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cmte_id 536041 non-null object
1 cand_id 536041 non-null object
2 cand_nm 536041 non-null object
3 contbr_nm 536041 non-null object
4 contbr_city 536026 non-null object
5 contbr_st 536040 non-null object
6 contbr_zip 535973 non-null object
7 contbr_employer 525088 non-null object
8 contbr_occupation 530520 non-null object
9 contb_receipt_amt 536041 non-null float64
10 contb_receipt_dt 536041 non-null object
11 receipt_desc 8479 non-null object
12 memo_cd 49718 non-null object
13 memo_text 52740 non-null object
14 form_tp 536041 non-null object
15 file_num 536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB
None
Process finished with exit code 0
2.指定数据截取,将如下字段的数据进行提取,其他数据舍弃
cand_nm :候选人姓名
contbr_nm : 捐赠人姓名
contbr_st :捐赠人所在州
contbr_employer : 捐赠人所在公司
contbr_occupation : 捐赠人职业
contb_receipt_amt :捐赠数额(美元)
contb_receipt_dt : 捐款的日期
#2.指定数据截取,将如下字段的数据进行提取,其他数据舍弃
df=df[['cand_nm','contbr_nm','contbr_st','contbr_employer','contbr_occupation','contb_receipt_amt','contb_receipt_dt']]
print(df.head())
3.对新数据进行总览df.info(),查看是否存在缺失数据
#3.对新数据进行总览df.info(),查看是否存在缺失数据
print(df.info())
输出结果
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cand_nm 536041 non-null object
1 contbr_nm 536041 non-null object
2 contbr_st 536040 non-null object
3 contbr_employer 525088 non-null object
4 contbr_occupation 530520 non-null object
5 contb_receipt_amt 536041 non-null float64
6 contb_receipt_dt 536041 non-null object
dtypes: float64(1), object(6)
memory usage: 28.6+ MB
None
Process finished with exit code 0