【课后题练习】陈强-机器学习-Python-Ch9 惩罚回归（student-mat.csv）

赛博机器喵

已于 2024-08-19 21:10:38 修改

阅读量1.1k

点赞数 14

分类专栏：陈强-机器学习-Python 文章标签：机器学习 python 笔记学习

于 2024-08-15 20:46:09 首次发布

本文链接：https://blog.youkuaiyun.com/2201_76026029/article/details/141215818

版权

系列文章目录

【学习笔记】陈强-机器学习-Python-Ch4 线性回归
 【学习笔记】陈强-机器学习-Python-Ch5 逻辑回归
 【课后题练习】陈强-机器学习-Python-Ch5 逻辑回归（SAheart.csv）
【学习笔记】陈强-机器学习-Python-Ch6 多项逻辑回归
 【学习笔记及课后题练习】陈强-机器学习-Python-Ch7 判别分析
 【学习笔记】陈强-机器学习-Python-Ch8 朴素贝叶斯
 【学习笔记】陈强-机器学习-Python-Ch9 惩罚回归

前言

本学习笔记仅为以防自己忘记了，顺便分享给一起学习的网友们参考。如有不同意见/建议，可以友好讨论。

本学习笔记所有的代码和数据都可以从陈强老师的个人主页上下载

参考书目：陈强.机器学习及Python应用. 北京：高等教育出版社, 2021.

数学原理等详见陈强老师的 PPT

参考了：
网友带我去滑雪的机器学习之惩罚回归—基于python实现（附完整代码）

一、数据集student-mat.csv

UCI Machine Learning Repository 的葡萄牙高中数学成绩数据student-mat.csv。
变量s:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)

响应变量:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)

二、课后习题

1.载入数据

（1）由于此csv文件以分号 “；”分割，故使用命令“pd.read_table(‘student-mat.csv’,sep=‘;’)”载入数据，并考察此数据框的形状与前5个观测值；

#导入模块
import numpy as np
import pandas as pd
#载入数据
student_mat =pd.read_table(r'D:\桌面文件\Python\【陈强-机器学习】MLPython-PPT-PDF\MLPython_Data\student-mat.csv',sep=';')
#数据集基本信息
print(student_mat.shape)
student_mat.head()

结果输出： (395, 33) 395个观测值与35个变量
在这里插入图片描述

`笔记：pd.read_table(）`

pd.read_table() 是一个读取文本文件的函数，功能类似于 pd.read_csv()。
pd.read_table() 默认以制表符（\t）作为分隔符，而 pd.read_csv() 默认以逗号（,）作为分隔符。
pd.read_table() 是用来读取以制表符分隔的文本文件的，但也可以通过设置参数来读取其他分隔符的文件。

#基本语法和参数
import pandas as pd
# 读取文件
df = pd.read_table(
    'file_path.txt', 
    sep=',',  #分隔符或字段分隔符。默认为 \t（制表符）。如果文件使用其他分隔符，比如逗号（,）或分号（;）
    header=None,  # 默认为 infer，即自动推断。如果没有列名行，可以设置为 None。
    names=['col1', 'col2', 'col3'], #指定列名。如果文件没有列名行，可以通过这个参数提供列名。
    index_col=0,  #指定用作索引的列。可以是列的名称或列的索引号。
    usecols=['col1', 'col3'