Ames 房价预测

最新推荐文章于 2023-11-27 00:39:03 发布

oleg

最新推荐文章于 2023-11-27 00:39:03 发布

阅读量3k

点赞数 2

CC 4.0 BY-SA版权

分类专栏： csdn ames 文章标签： csdn ames

本文链接：https://blog.youkuaiyun.com/sinat_37554070/article/details/80229966

csdn 同时被 2 个专栏收录

3 篇文章

订阅专栏

ames

1 篇文章

订阅专栏

本文介绍了一个Kaggle竞赛任务——Ames房价预测的过程。通过对原始数据集进行细致的数据探索、特征工程、模型构建及评估，最终采用岭回归、最小二乘法及Lasso回归等模型进行预测。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

oleg--第五期人工智能特训班

Ames房价预测是Kaggle上的一个竞赛任务。原始数据集一共有81个特征，其中数值型特征38个，非数值型特征43个，并且有缺失值。

本文的解题步骤为数据探索特征工程模型构建和模型评估

1 数据探索

通过数据的探索，了解各个特征的特性以及相互之间的关系，为第二步的特征工程，做准备

1.1 首先，我们需要导入相关包并读入数据

导入相关包：
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
from scipy import sparse
import matplotlib.pyplot as plt
读入数据：
train=pd.read_csv('Ames_House_train.csv')#根据各人不同设置路径
test=pd.read_csv('Ames_House_test.csv')
1.2 探索数据的信息

通过pands.head()函数和pandas.info()函数来初步探索数据的信息
train.head()
head()函数得到数据前5行的信息
train.info()
info()函数得到所有特征的类型，以及是否有非空

1.3 探索单个特征

这里探索了目标变量y的分布
train_y=train['SalePrice']
fig=plt.figure()
plt.hist(train_y)
plt.show()
可知目标值，并不太符合正太分布，由于经典统计学中的许多理论都是基于随机变量满足正态分布这个假设条件推导出来的，所以目标值在特征工程中需要转化为正态分布

1.4 探索两两特征之间的关系

这里探索特征GrLivArea与房间SalePrice之间的关系
#探索特征与目标之间的关系
plt.scatter(train['GrLivArea'],train_y)
plt.show()
从图中可知，右下角有些异常点，面积大而房价低，不符合正常现象，这样的房子应该是远离市中心的，需要在特征工程中，剔除这样的异常值。

探索两两特征之间的相关性
#特征两两之间的相关性的
corr=train.corr()
plt.subplots(figsize=(20,18))
sns.heatmap(corr,annot=True)
plt.show()
如果两个特征之间的相关性较大，可以剔除一个，为了保证获得足够的信息，一般可以部剔除。

2 特征工程

特征工程是机器学习中很重要的一步，在这里我们需要对探索的数据进行处理。

2.1 数据清洗

这里我们的数据清洗工作在，下面间接的完成了，所以此处没有

2.2 缺失值处理

一般的数值特征的缺失值处理

#对于数值特征(2)，将缺失值用0填补，代表没有
num_misfeatures=['GarageYrBlt','MasVnrArea']
for num_misfeature in num_misfeatures:
    train[num_misfeature]=train[num_misfeature].fillna(0)

一般的类别特征的缺失值处理

#对于类别特征(11)，将缺失值用None填补，代表没有
ca_misfeatures=['PoolQC','MiscFeature','Alley','Fence','FireplaceQu','MasVnrType','GarageCond',
                'BsmtExposure','BsmtCond','BsmtQual','BsmtFinType2','BsmtFinType1','GarageQual','GarageType','GarageFinish']
for ca_misfeature in ca_misfeatures:
    train[ca_misfeature]=train[ca_misfeature].fillna('None')

根据相关知识，处理特征缺失值

#对于这类特征(1)，将缺失值用众数填补
lots_features=['Electrical']
for lots_feature in lots_features:
    train[lots_feature]=train[lots_feature].fillna(train[lots_feature].mode()[0])
#对于这类特征(1)，将缺失值用中位数填补
train['LotFrontage']=train.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))

2.3 数值特征处理

#目标变量对数化，转为正太分布
train_y=train.SalePrice
train_y=np.log(train_y)
plt.hist(train_y)
plt.show()

可知对数处理过后的目标值，大致符合正态分布

2.4 类别特征处理

#类别特征处理
#对于数值型的类别，将其转化为字符串，为了下面的labelencoder编码
num_ca_features=['MSSubClass','OverallCond','YrSold','MoSold']
for num_ca_feature in num_ca_features:
    train[num_ca_feature]=train[num_ca_feature].apply(str)
#LabelENcoder,对文本特征进行编码，转化为数字
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
one_hot_features=['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
       'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional',
       'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
       'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType',
       'SaleCondition','MSSubClass','OverallCond','YrSold','MoSold']
for one_hot_feature in one_hot_features:
    lbl=LabelEncoder()
    train[one_hot_feature] = lbl.fit_transform(train[one_hot_feature].values)
#OneHotEncoder,编码
from scipy import sparse
for one_hot_feature in one_hot_features:
    enc=OneHotEncoder()
    train_a= enc.fit_transform(train[one_hot_feature].values.reshape(-1,1))
    train_x=sparse.hstack((train_x,train_a))
print('one-hot prepared!')

此处间接的进行了数据清洗，所以我们就不需要再进行数据清洗工作了

3 模型构建与评估

这里我们选用了岭回归最小为二乘法和Lasso回归模型

导入相关包：

#导入相关包
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeCV,LassoCV,LinearRegression
from sklearn.metrics import mean_squared_error

分割训练集和测试集

#测试集和训练集分割
x_train,x_test,y_train,y_test=train_test_split(train_x,train_y,random_state=42,test_size=0.33)

在比赛中，我们需要使用测试集，并提交结果。这里我仅对训练集进行分割，得到测试集，并评估模型

岭回归

#岭回归
alp=[0.01,0.1,1,10,100]
ridge=RidgeCV(alphas=alp,store_cv_values=True)
ridge.fit(x_train,y_train)
pre=ridge.predict(x_test)
print('the rmse of Ridge is :',mean_squared_error(y_test,pre))
print ('alpha is:', ridge.alpha_)

得到的结果为：the rmse of Ridge is : 0.0159712118854 alpha is: 10.0

最小二乘法

#最小二乘法
lr=LinearRegression()
lr.fit(x_train,y_train)
pre=lr.predict(x_test)
print('the rmse of LinearRegression is :',mean_squared_error(y_test,pre))

得到的结果为：the rmse of LinearRegression is : 0.0198161704078

LassoCV回归

#最小二乘法
#LassoCV
lasso=LassoCV()
lasso.fit(x_train,y_train)
pre=lasso.predict(x_test)
print('the rmse of Lasso is :',mean_squared_error(y_test,pre))
print ('alpha is:', lasso.alpha_)

得到的结果为： the rmse of Lasso is : 0.0315603608941 alpha is: 1.14726469554