系列文章目录
【学习笔记】 陈强-机器学习-Python-Ch4 线性回归
【学习笔记】 陈强-机器学习-Python-Ch5 逻辑回归
【课后题练习】 陈强-机器学习-Python-Ch5 逻辑回归(SAheart.csv)
【学习笔记】 陈强-机器学习-Python-Ch6 多项逻辑回归
【学习笔记 及 课后题练习】 陈强-机器学习-Python-Ch7 判别分析
【学习笔记】 陈强-机器学习-Python-Ch8 朴素贝叶斯
【学习笔记】 陈强-机器学习-Python-Ch9 惩罚回归
文章目录
前言
本学习笔记 仅为以防自己忘记了,顺便分享给一起学习的网友们参考。如有不同意见/建议,可以友好讨论。
本学习笔记 所有的代码和数据都可以从 陈强老师的个人主页 上下载
参考书目:陈强.机器学习及Python应用. 北京:高等教育出版社, 2021.
数学原理等 详见陈强老师的 PPT
参考了:
网友带我去滑雪的机器学习之惩罚回归—基于python实现(附完整代码)
一、数据集student-mat.csv
UCI Machine Learning Repository 的葡萄牙高中数学成绩数据student-mat.csv。
变量s:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
响应变量:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
二、课后习题
1.载入数据
(1)由于此csv文件以分号 “;”分割,故使用命令“pd.read_table(‘student-mat.csv’,sep=‘;’)”载入数据,并考察此数据框的形状与前5个观测值;
#导入模块
import numpy as np
import pandas as pd
#载入数据
student_mat =pd.read_table(r'D:\桌面文件\Python\【陈强-机器学习】MLPython-PPT-PDF\MLPython_Data\student-mat.csv',sep=';')
#数据集基本信息
print(student_mat.shape)
student_mat.head()
结果输出: (395, 33) 395个观测值与35个变量
笔记:pd.read_table()
pd.read_table() 是一个读取文本文件的函数,功能类似于 pd.read_csv()。
pd.read_table() 默认以制表符(\t)作为分隔符,而 pd.read_csv() 默认以逗号(,)作为分隔符。
pd.read_table() 是用来读取以制表符分隔的文本文件的,但也可以通过设置参数来读取其他分隔符的文件。
#基本语法和参数
import pandas as pd
# 读取文件
df = pd.read_table(
'file_path.txt',
sep=',', #分隔符或字段分隔符。默认为 \t(制表符)。如果文件使用其他分隔符,比如逗号(,)或分号(;)
header=None, # 默认为 infer,即自动推断。如果没有列名行,可以设置为 None。
names=['col1', 'col2', 'col3'], #指定列名。如果文件没有列名行,可以通过这个参数提供列名。
index_col=0, #指定用作索引的列。可以是列的名称或列的索引号。
usecols=['col1', 'col3'