Kaggle-房价预测

本文介绍了Kaggle房价预测的完整流程,包括数据观察、数据清洗、特征工程和模型建立。通过对数据进行log转换处理偏斜分布,删除不重要特征,填充或删除缺失值,进行特征编码和数值变量标准化。采用ElasticNet模型,通过交叉验证找到最佳参数。同时,结合随机森林进行模型融合,最终得到28%的预测准确率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一. 数据观察

又是一道Kaggle的经典题目。首先观察一下数据:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import ensemble, tree, linear_model
from sklearn.ensemble import RandomForestRegressor
%matplotlib inline

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv') 
train_data.shape,test_data.shape

训练集中共有1460个样本以及81个特征,而测试集中有1459个样本80个特征(因为SalePrices是需要预测的)。

首先我们将测试集中的房价单独提取出来,作为一会儿模型训练中的因变量:

train_y = train_data.pop('SalePrice')
y_plot = sns.distplot(train_y)

顺便看一看y的分布曲线:


很明显,数据是右斜的,为了使数据的呈现方式接近我们所希望的前提假设,从而更好的进行统计推断,将数据log化,得到:

train_y = np.log(train_y)
y_plot = sns.distplot(train_y)

此时因变量基本呈正态分布。


由于特征实在太多,我们选用其中几个对房价进行观察

1.YearBuilt和SalePrice的关系

var = 'YearBuilt'
data = pd.concat([train_y_skewed, train_data[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);plt.xticks(rotation=90);



可以看出在最近三十年内建的房子的

### Kaggle 房价预测完整代码示例 #### 数据加载与预处理 为了进行房价预测,首先需要导入必要的库并读取数据: ```python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from xgboost import XGBRegressor from sklearn.metrics import mean_squared_error # 加载训练和测试数据集 train_data = pd.read_csv('https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/train.csv') test_data = pd.read_csv('https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/test.csv') # 查看数据形状 print(f"The shape of training data: {train_data.shape}")[^3] # 处理缺失值 numerical_features = ['LotFrontage', 'MasVnrArea'] categorical_features = ['Alley', 'PoolQC'] numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numerical_features), ('cat', categorical_transformer, categorical_features)]) X_train = preprocessor.fit_transform(train_data.drop(['SalePrice'], axis=1)) y_train = train_data['SalePrice'].values.reshape(-1, 1) # 将标签也标准化 label_scaler = StandardScaler() y_train_scaled = label_scaler.fit_transform(y_train) ``` #### 构建模型 接下来构建XGBoost回归器,并对其进行训练: ```python model_pipeline = Pipeline([ ('preprocessor', preprocessor), ('regressor', XGBRegressor(objective="reg:squarederror", n_estimators=100))]) # 训练模型 model_pipeline.fit(X_train, y_train_scaled.ravel()) ``` #### 预测与评估 最后,在验证集上评估模型性能,并准备提交文件: ```python # 对测试集做相同变换 X_test = preprocessor.transform(test_data) # 进行预测并将结果反向转换回原始尺度 predictions = model_pipeline.predict(X_test) predicted_prices = label_scaler.inverse_transform(predictions.reshape(-1, 1)) # 创建提交文件 submission_df = pd.DataFrame({ "Id": test_data["Id"], "SalePrice": predicted_prices.flatten()}) submission_df.to_csv("submission.csv", index=False) # 输出部分预测结果查看 print(submission_df.head()) # 如果有验证集可以计算均方根误差(RMSE) if validation_set_available: rmse = np.sqrt(mean_squared_error(true_values, predictions)) print(f'Root Mean Squared Error on Validation Set is :{rmse}') ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值