问题背景:购房者需要购买梦想中的房子,你需要从房子的79个变量中预测房子的价格是多少.
分为以下几个步骤:
- 导入数据观察每个变量特征的意义以及对于房价的重要程度
- 筛选出主要影响房价的变量
- 清洗和转换变量
- 测试和输出数据
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data 这个是官方的地址可以从中获得数据文件
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')
data_train.sample(3)
def drop_low_colums(df):
data_corr=df.corr()
d_list=data_corr[data_corr.SalePrice<0.5].index.tolist()
return d_list
data_train=data_train.drop(['YearBuilt','1stFlrSF'], axis=1)
data_test=data_test.drop(['YearBuilt','1stFlrSF'], axis=1)
data_drop_train=data_train.drop(['Alley','PoolQC','Fence','MiscFeature'], axis=1)
data_drop_test=data_test.drop(['Alley','PoolQC','Fence','MiscFeature'], axis=1)
drop_list=drop_low_colums(data_drop_train)
data_drop_train1=data_drop_train.drop(drop_list, axis=1)
data_drop_test1=data_drop_test.drop(drop_list, axis=1)
data_drop_train1=data_drop_train1.fillna('0')
data_drop_test1=data_drop_test1.fillna('0')
from sklearn imp