此案例为kaggle上面的房价预测案例
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
具体代码如下
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#-------------------Step1:读取数据---------------------
train_df=pd.read_csv('C:\\Users\\Administrator\\Desktop\\network\\House_prise\\train.csv')
test_df=pd.read_csv('C:\\Users\\Administrator\\Desktop\\network\\House_prise\\test.csv')
#查看数据前5行
print(train_df.head())
print(test_df.head())
'''
Id MSSubClass MSZoning ... SaleType SaleCondition SalePrice
0 1 60 RL ... WD Normal 208500
1 2 20 RL ... WD Normal 181500
2 3 60 RL ... WD Normal 223500
3 4 70 RL ... WD Abnorml 140000
4 5 60 RL ... WD Normal 250000
[5 rows x 81 columns]
Id MSSubClass MSZoning ... YrSold SaleType SaleCondition
0 1461 20 RH ... 2010 WD Normal
1 1462 20 RL ... 2010 WD Normal
2 1463 60 RL ... 2010 WD Normal
3 1464 60 RL ... 2010 WD Normal
4 1465 120 RL ... 2010 WD Normal
[5 rows x 79 columns]
'''
#-----------------------------------Step2:合并数据---------------------------------
#这么做是为了把训练集和测试集的数据一块来一次性处理,等到处理完后再将他们分开
#上面在训练集中的最后一个属性SalePrice为我们的训练目标,因此只出现在训练集中,所以我们先将它
#作为y_train拿出来
#我们先看SalePrice长什么样
prices=pd.DataFrame({'Prices':train_df['SalePrice'],'log(price+1)':np.log1p(train_df['SalePrice'])})
print(prices.hist())
plt.show()