流程图:

1.读数据表
成人收入预测数据集是由Ronny Kohavi和Barry Becker从美国某地区1994年的人口普查局数据库中提取的。该数据集包含32561位成人年收入及14个相关的指标。可以用此数据集来进行收入的预测,预测任务是确定一个人的年收入是否超过5万美元。 首先读取数据集,并查看数据集的前五行。
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
此数据集共有15个变量,其中有9个分类变量依次是工作类型workclass, 受教育程度education, 婚姻状态marital_status,职业occupation,家庭成员关系 relationship, 种族race, 性别sex, 国籍native_country, 收入salary;有6个连续型变量分别是年龄age,序号fnlwgt,受教育时长education_num,资本收益capital_gain,资本损失capital_loss,每周工作小时数hours_per_week。
2.缺失值检测
接着我们详细查看变量的基本情况以及数据中是否存在缺失值。
数据缺失值情况:
| 列名 | 缺失值数量 |
|---|---|
| age | 0 |
| workclass | 0 |
| fnlwgt | 0 |
| education | 0 |
| education-num | 0 |
| marital-status | 0 |
| occupation | 0 |
| relationship | 0 |
| race | 0 |
| sex | 0 |
过滤的缺失值行数:0
通过缺失值检测发现不存在缺失值。但通过观察数据集发现,数据中有三组变量存在异常取值,接下来应对异常值进行处理。分别对工作类型workclass、职业occupation、国籍native_country三组分类数据异常值进行替换,即取值为?的异常值替换为unknown。
3.工作类型异常值替换
对工作类型workclass进行异常值替换。
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
工作类型workclass的异常取值已成功替换。
4.职业异常值替换
对职业occupation进行异常值替换。
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
职业occupation异常取值已成功替换。
5.国籍异常值替换
对国籍native-country异常取值进行替换。
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
国籍native-country异常取值已成功替换。
6.字段基本统计信息
查看数据集中数据的基本统计信息。
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 样本数 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 |
| 不同取值个数 | 9 | 16 | 7 | 15 | 6 | 5 | 2 | 42 | 2 | ||||||
| 众数 | Private | HS-grad | Married-civ-spouse | Prof-specialty | Husband | White | Male | United-States | <=50K | ||||||
| 众数的频数 | 22696 | 10501 | 14976 | 4140 | 13193 | 27816 | 21790 | 29170 | 24720 | ||||||
| 均值 | 38.5816467553 | 189778.366512085 | 10.0806793403 | 1077.6488437087 | 87.303829735 | 40.4374558521 | |||||||||
| 标准差 | 13.6404325536 | 105549.9776970222 | 2.5727203321 | 7385.2920848403 | 402.960218649 | 12.3474286817 | |||||||||
| 最小值 | 17 | 12285 | 1 | 0 | 0 | 1 | |||||||||
| 下四分位数 | 28 | 117827 | 9 | 0 | 0 | 40 | |||||||||
| 中位数 | 37 | 178356 | 10 | 0 | 0 | 40 | |||||||||
| 上四分位数 | 48 | 237051 | 12 | 0 | 0 | 45 | |||||||||
| 最大值 | 90 | 1484705 | 16 | 99999 | 4356 | 99 |
可以看出年龄age、序号fnlwgt、受教育时长education-num、资本收益capital-gain、资本损失capital-loss、每周工作小时数hours-per-week为数值型变量,其余均为分类变量。数值型变量中序号fnlwgt、资本收益capital-gain和资本损失capital-loss,数据分布都较为分散,最大值都是均值的数十倍。
首先,对标签列个体年收入salary进行可视化分析,分别绘制年收入salary分布的饼状图、柱状图直观显示数据的分布情况,便于后续建模。

绘制年收入salary的柱状图,观察频数分布情况。

由于个体收入与工作类型有直接影响,所以对工作类型workclass进行可视化分析,统计各工作类型的分布并绘制柱状图,对比各工作的收入占比。
对受教育时间education-num绘制柱状图,观察数据分布情况。

绘制收入salary分布与个体性别sex的柱状图,分析数据间的关系。



17.逻辑回归
使用训练集训练逻辑回归模型,得到的各个特征的系数如下表所示:
系数
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.564702 | -0.196987 | 0.032008 | 0.050004 | 0.854733 | -0.431347 | 0.012287 | -0.09928 | 0.102477 | 0.467217 | 2.338441 | 0.265758 | 0.423848 | -0.001297 |
可以看出,资本收益的系数最高为2.338,其次是受教育时长系数为0.855,年龄系数为0.565,这与日常知识一致,有资本收益,受教育时间长的个体收入水平一般较高。下面进行模型预测。
18.模型预测
利用训练出的逻辑回归模型对测试集进行预测,结果如下:
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | salary_predict |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.8490804496 | -0.2379060115 | -0.1199390202 | 1.2148687394 | -0.0313600271 | -1.7340583484 | -1.4835817968 | 1.5893223617 | 0.3936675268 | -1.4223307593 | -0.1459204836 | -0.216659527 | 0.2885296159 | 0.2513776468 | 0 | 0 |
| -0.8490804496 | -0.2379060115 | 0.2529895706 | -0.335436928 | 1.1347387638 | 0.9216339465 | 0.5956350429 | -0.2778050392 | 0.3936675268 | -1.4223307593 | -0.1459204836 | -0.216659527 | -0.035429447 | 0.2513776468 | 0 | 0 |
| -0.9957056174 | -0.2379060115 | 0.6298973802 | -0.8522054837 | 0.7460391668 | -0.4062122009 | 1.0576832296 | -0.9001808395 | 0.3936675268 | 0.703071345 | -0.1459204836 | -0.216659527 | -0.035429447 | 0.2513776468 | 0 | 0 |
| 0.5438586447 | -0.2379060115 | -0.3992328043 | -1.6273583174 | -2.7522572057 | -0.4062122009 | 1.5197314162 | -0.9001808395 | -4.3189090683 | 0.703071345 | -0.1459204836 | 4.503481865 | -0.035429447 | 0.2513776468 | 0 | 0 |
| 0.4705460608 | -0.2379060115 | -0.1606502177 | -2.4025111511 | -1.1974588179 | -1.7340583484 | 1.5197314162 | -0.2778050392 | 0.3936675268 | 0.703071345 | -0.1459204836 | 6.791584054 | 2.8802021189 | 0.2513776468 | 1 | 1 |
19.分类模型评估
将预测结果和真实值进行比较来对逻辑回归模型进行评估。得到的分类报告和混淆矩阵如下:
分类报告(classification report)
| 标签 | 精确率(Precision) | 召回率(Recall) | F1值(F1-score) |
|---|---|---|---|
| 0.0 | 0.91 | 0.76 | 0.83 |
| 1.0 | 0.51 | 0.77 | 0.62 |
| accuracy | 0.76 | 0.76 | 0.76 |
| macro avg | 0.71 | 0.77 | 0.72 |
| weighted avg | 0.81 | 0.76 | 0.78 |
混淆矩阵(confusion matrix)


由分类报告可以看出:预测为0(salary<=50k)的精确率高达0.91;预测为1(salary>50k)的精确度为0.51,考虑是因为数据中salary>50k的人数过少,不足总数的25%,故导致分类结果不准确。模型结果的AUC值为0.85,此分类模型有较好的预测效果。
总结
本案例中我们首先对数据进行缺失值检测,并通过观察原始数据发现存在异常取值,对异常值进行了替换;接着通过探索年收入与性别、工作类型等的关系,通过可视化的方法对变量之间的关系进行了描述。最后经过特征编码利用机器学习中逻辑回归对个体年收入进行预测,分类效果较好。
1663

被折叠的 条评论
为什么被折叠?



