数据集参考如下网址:
前言:
本文分为两期,篇幅过长看着不方便。主要介绍该数据集的数据分析步骤,包括数据预处理,数据挖掘,选取合适的模型算法进行解决问题,能力有限,仅供参考,特征讲解为英文,有需要的小伙伴可以参考如下:
SalePrice: 房产销售价格,以美元计价。所要预测的目标变量
- MSSubClass: Identifies the type of dwelling involved in the sale 住所类型
- MSZoning: The general zoning classification 区域分类
- LotFrontage: Linear feet of street connected to property 房子同街道之间的距离
- LotArea: Lot size in square feet 建筑面积
- Street: Type of road access 主路的路面类型
- Alley: Type of alley access 小道的路面类型
- LotShape: General shape of property 房屋外形
- LandContour: Flatness of the property 平整度
- Utilities: Type of utilities available 配套公用设施类型
- LotConfig: Lot configuration 配置
- LandSlope: Slope of property 土地坡度
- Neighborhood: Physical locations within Ames city limits 房屋在埃姆斯市的位置
- Condition1: Proximity to main road or railroad 附近交通情况
- Condition2: Proximity to main road or railroad (if a second is present) 附近交通情况(如果同时满足两种情况)
- BldgType: Type of dwelling 住宅类型
- HouseStyle: Style of dwelling 房屋的层数
- OverallQual: Overall material and finish quality 完工质量和材料
- OverallCond: Overall condition rating 整体条件等级
- YearBuilt: Original construction date 建造年份
- YearRemodAdd: Remodel date 翻修年份
- RoofStyle: Type of roof 屋顶类型
- RoofMatl: Roof material 屋顶材料
- Exterior1st: Exterior covering on house 外立面材料
- Exterior2nd: Exterior covering on house (if more than one material) 外立面材料2
- MasVnrType: Masonry veneer type 装饰石材类型
- MasVnrArea: Masonry veneer area in square feet 装饰石材面积
- ExterQual: Exterior material quality 外立面材料质量
- ExterCond: Present condition of the material on the exterior 外立面材料外观情况
- Foundation: Type of foundation 房屋结构类型
- BsmtQual: Height of the basement 评估地下室层高情况
- BsmtCond: General condition of the basement 地下室总体情况
- BsmtExposure: Walkout or garden level basement walls 地下室出口或者花园层的墙面
- BsmtFinType1: Quality of basement finished area 地下室区域质量
- BsmtFinSF1: Type 1 finished square feet Type 1完工面积
- BsmtFinType2: Quality of second finished area (if present) 二次完工面积质量(如果有)
- BsmtFinSF2: Type 2 finished square feet Type 2完工面积
- BsmtUnfSF: Unfinished square feet of basement area 地下室区域未完工面积
- TotalBsmtSF: Total square feet of basement area 地下室总体面积
- Heating: Type of heating 采暖类型
- HeatingQC: Heating quality and condition 采暖质量和条件
- CentralAir: Central air conditioning 中央空调系统
- Electrical: Electrical system 电力系统
- 1stFlrSF: First Floor square feet 第一层面积
- 2ndFlrSF: Second floor square feet 第二层面积
- LowQualFinSF: Low quality finished square feet (all floors) 低质量完工面积
- GrLivArea: Above grade (ground) living area square feet 地面以上部分起居面积
- BsmtFullBath: Basement full bathrooms 地下室全浴室数量
- BsmtHalfBath: Basement half bathrooms 地下室半浴室数量
- FullBath: Full bathrooms above grade 地面以上全浴室数量
- HalfBath: Half baths above grade 地面以上半浴室数量
- Bedroom: Number of bedrooms above basement level 地面以上卧室数量
- KitchenAbvGr: Number of kitchens 厨房数量
- KitchenQual: Kitchen quality 厨房质量
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) 总房间数(不含浴室和地下部分)
- Functional: Home functionality rating 功能性评级
- Fireplaces: Number of fireplaces 壁炉数量
- FireplaceQu: Fireplace quality 壁炉质量
- GarageType: Garage location 车库位置
- GarageYrBlt: Year garage was built 车库建造时间
- GarageFinish: Interior finish of the garage 车库内饰
- GarageCars: Size of garage in car capacity 车壳大小以停车数量表示
- GarageArea: Size of garage in square feet 车库面积
- GarageQual: Garage quality 车库质量
- GarageCond: Garage condition 车库条件
- PavedDrive: Paved driveway 车道铺砌情况
- WoodDeckSF: Wood deck area in square feet 实木地板面积
- OpenPorchSF: Open porch area in square feet 开放式门廊面积
- EnclosedPorch: Enclosed porch area in square feet 封闭式门廊面积
- 3SsnPorch: Three season porch area in square feet 时令门廊面积
- ScreenPorch: Screen porch area in square feet 屏风门廊面积
- PoolArea: Pool area in square feet 游泳池面积
- PoolQC: Pool quality 游泳池质量
- Fence: Fence quality 围栏质量
- MiscFeature: Miscellaneous feature not covered in other categories 其它条件中未包含部分的特性
- MiscVal: $Value of miscellaneous feature 杂项部分价值
- MoSold: Month Sold 卖出月份
- YrSold: Year Sold 卖出年份
- SaleType: Type of sale 出售类型
- SaleCondition: Condition of sale 出售条件
首先,导入所需要的各种包:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam
导入数据:
df_train = pd.read_csv('C:/Users/Desktop/大数据练习/房价预测/train.csv')
df_test = pd.read_csv('C:/Users/Desktop/大数据练习/房价预测/test.csv')
首先,我们先来看一下训练集的概况:
train_data.describe()
train_data.info()
有很多空值,看来下面需要对空值进行处理,不然结果准确性很差。
接下来我想看看这个价格的分布情况
sns.distplot(train_data['SalePrice']);
价格的分布情况还不错,稍微有点偏。
接下来处理空缺值,我们去计算空缺值:
total = train_data.isnull().sum().sort_values(ascending=False)
percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
开始对空值进行删除和填充
#删除没有价值的特征
train_data['LotFrontage'].fillna((train_data['LotFrontage'].mean()), inplace=True)
train_data.drop('Alley', axis=1, inplace=True)
train_data.drop('PoolQC', axis=1, inplace=True)
train_data.drop('Fence', axis=1, inplace=True)
train_data.drop('MiscFeature', axis=1, inplace=True)
#对数据进行填充
train_data['MasVnrType'].fillna((train_data['MasVnrType'].value_counts().index[0]), inplace=True)
train_data['MasVnrArea'].fillna((train_data['MasVnrArea'].mean()), inplace=True)
train_data['BsmtQual'].fillna((train_data['BsmtQual'].value_counts().index[0]), inplace=True)
train_data['BsmtCond'].fillna((train_data['BsmtCond'].value_counts().index[0]), inplace=True)
train_data['BsmtExposure'].fillna((train_data['BsmtExposure'].value_counts().index[0]), inplace=True)
train_data['BsmtFinType1'].fillna((train_data['BsmtFinType1'].value_counts().index[0]), inplace=True)
train_data['BsmtFinType2'].fillna((train_data['BsmtFinType2'].value_counts().index[0]), inplace=True)
train_data['Electrical'].fillna((train_data['Electrical'].value_counts().index[0]), inplace=True)
train_data.drop('FireplaceQu', axis=1, inplace=True)
train_data['GarageType'].fillna((train_data['GarageType'].value_counts().index[0]), inplace=True)
train_data['GarageYrBlt'].fillna((train_data['GarageYrBlt'].mean()), inplace=True)
train_data['GarageFinish'].fillna((train_data['GarageFinish'].value_counts().index[0]), inplace=True)
train_data['GarageQual'].fillna((train_data['GarageQual'].value_counts().index[0]), inplace=True)
train_data['GarageCond'].fillna((train_data['GarageCond'].value_counts().index[0]), inplace=True)
提取重要指标进行处理:
cols = [
'MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual',
'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
'SaleType', 'SaleCondition']
train_data[cols] = train_data[cols].astype('category')
我们对列进行编码处理:
class sklearn.preprocessing.OrdinalEncoder
(*, categories='auto', dtype=<class 'numpy.float64'>)
将分类特征编码为整数数组。该转换器的输入应为整数或字符串之类的数组,表示分类(离散)特征所采用的值。要素将转换为序数整数。这将导致每个要素的一列整数(0到n_categories-1)。
fit(X[, y]) | 将序数编码器拟合到 X。 |
fit_transform(X[, y]) | 适应数据,然后对其进行转换。 |
get_params([deep]) | 获取此估计器的参数。 |
将数据转换回原始表示形式。 | |
set_params(**参数) | 设置此估计器的参数。 |
transform(X) | 将 X 转换为序号代码。 |
让其生成编码,这样就更加直观去观察我们的数据:
encoder = OrdinalEncoder()
encoder.fit(train_data[cols])
train_data[cols] = encoder.transform(train_data[cols])
train_data
开始进行相关性分析:
# 寻找相关性
corr_matrix = train_data.corr()#皮森尔相关系数
corr_values = corr_matrix["SalePrice"].sort_values(ascending=False)
top_10 = corr_values[:10]
bottom_10 = corr_values[-10:]
根据上述选择最佳列;
# 根据上述选择最佳列
best_cols = [
'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF', '1stFlrSF',
'GarageFinish', 'KitchenQual', 'BsmtQual', 'ExterQual', 'SalePrice'
]
train_data = train_data[best_cols]
我们采用 3σ原则进行数据清洗:
建立在正态分布的等精度重复测量基础上而造成奇异数据的干扰或噪声难以满足正态分布。
如果一组测量数据中的某个测量值的残余误差的绝对值 νi>3σ,则该测量值为坏值,应剔除。通常把等于 ±3σ的误差作为极限误差,对于正态分布的随机误差,落在 ±3σ以外的概率只有 0.27%,它在有限次测量中发生的可能性很小,故存在3σ准则。3σ准则是最常用也是最简单的粗大误差判别准则,它一般应用于测量次数充分多( n ≥30)或当 n>10做粗略判别时的情况。
#删除异常值使用四分位数来改进删除可能改变结果的数据
train_data = train_data[train_data['GrLivArea'] <= train_data['GrLivArea'].quantile(0.75)]
train_data = train_data[train_data['GarageArea'] <= train_data['GarageArea'].quantile(0.75)]
train_data = train_data[train_data['TotalBsmtSF'] <= train_data['TotalBsmtSF'].quantile(0.75)]
train_data = train_data[train_data['1stFlrSF'] <= train_data['1stFlrSF'].quantile(0.75)]
数据归一化处理具体可以参考(63条消息) 数据处理-scipy.stats.norm函数,归一化的作用_am_student的博客-优快云博客:
cols = ['GrLivArea', 'GarageArea', 'TotalBsmtSF', '1stFlrSF']
standard_scaler = StandardScaler()
train_data[cols] = standard_scaler.fit_transform(train_data[cols])
这个时候我们看看我们的数据是一种什么样的状态:
train_data.describe()
X = train_data.drop('SalePrice',axis=1).values
y = train_data['SalePrice'].values
在训练集和测试集中分离数据:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
检查训练数据和测试数据的数据情况:
print("X_train - ",X_train.shape)
print("y_train - ",y_train.shape)
print("X_test - ",X_test.shape)
print("y_test - ",y_test.shape)
数据标准化:
scaler = MinMaxScaler()
X_train= scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
预处理结束,下面是模型的使用,中场休息,我们明天再来。
由于篇幅问题,在(下篇)中会具体把模型列出来,加急完工中.......
同时可以在公众号关注动态,一起学习一起讨论