前段时间尝试着做了一下kaggle中的House Prices,是一个回归问题,通过对给定的训练集进行分析,来预测测试集中的房屋价格,测试集中的数据主要是房屋特征(features),包活很多比如:卧室的数量,临街否等等共79个,在对其进行处理的过程中,要对数值型的特征进行归一化,而对字符型的特征进行pd.dummies()处理,及将字符用数字0,1,2等来代替,这样才能统筹兼顾对所有的特征进行处理,代码如下:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from mxnet import autograd,gluon,init,nd
from mxnet.gluon import loss as gloss,data as gdata,nn
import gluonbook as gb
train_data = pd.read_csv(r'D:\data\train.csv')
test_data = pd.read_csv(r'D:\data\test.csv')
#数据预处理
all_features = pd.concat([train_data.iloc[:,1:-1],test_data.iloc[:,1:]],axis = 0,ignore_index= True)
numeric_index = all_features.dtypes[all_features.dtypes != 'object'] #numeric_index 是Series格式
all_features[numeric_index.index] = all_features[numeric_index.index].apply(lambda x: (x - x.mean())/(x.std()))
all_features = all_features.fillna(all_features.mean())
#创建指示特征
all_features = pd.get_dummies(all_features,dummy_na = True) #这一步将(字符串类特征)用数字0和1来进行表示!增加了特征的数目从79到331
#提取训练数据和测试数据
n_train = train_data.shape[0]
train_features = nd.array(all_features[:n_train].values)
test_features = nd.array(all_features[n_train:].values)
train_labels = nd.array(train_data.iloc[:,-1]).reshape((-1,1)) #维度为(n,1)
#训练模型
loss = gloss.L2Loss()
def get_net():
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'),nn.Dense(1))
net.initialize()
return net
#定义竞赛所要求的的对数均方根误差
def log_rmse(net,train_features,train_labels):
clipped_pred = nd.clip(net(train_features),1,float('inf')) #将数据控制在大于1的范围,小于1的数变为1
rmse = nd.sqrt(2*loss(clipped_pred.log(),train_labels.log()).mean()) #乘以2是因为在gluon中的loss公式中为1/2
return rmse.asscalar()
#定义训练函数
#这里可加上学习率逐渐衰减的方法,并在网络中加入丢弃层
def train(net,train_features,train_labels,test_features,test_labels,num_opochs,learning_rate,weight_decay,batch_size):
train_l ,test_l = [],[]
train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features,train_labels),batch_size,shuffle = True) #小样本读取数据
#使用Adam优化
trainer= gluon.Trainer(net.collect_params(),'adam',{'learning_rate':learning_rate,'wd': weight_decay})
for epoch in range(num_opochs):
for X,y in train_iter:
with autograd.record():
l = loss(net(X),y)
l.backward()
trainer.step(batch_size)
train_l.append(log_rmse(net,train_features,train_labels))
if test_labels is not None:
test_l.append(log_rmse(net,test_features,test_labels))
return train_l,test_l
#进行K折交叉验证(得到训练集)
def get_k_fold_data(k,i,x,y):
assert(k>1)
fold_size = x.shape[0] // k
x_train,y_train = None,None
for j in range(k):
ix = slice(j * fold_size,(j+1) * fold_size) #slice的这种用法学习了
x_part,y_part = x[ix,:],y[ix]
if j == i:
x_valid,y_valid = x_part,y_part
elif x_train is None:
x_train,y_train = x_part,y_part #给x_train,y_train 第一次赋值
else:
x_train = nd.concat(x_train,x_part,dim = 0) #将(k-1)个数据集合合并为训练集
y_train = nd.concat(y_train,y_part,dim = 0)
return x_train,y_train,x_valid,y_valid
#返回训练和验证的平均误差
def k_fold(k,x_train,y_train,num_epochs,learning_rate,weight_decay,batch_size):
train_l_sum,valid_l_sum = 0,0
for i in range(k):
data = get_k_fold_data(k,i,x_train,y_train)
net = get_net()
train_ls , valid_ls = train(net,*data,num_epochs,learning_rate,weight_decay,batch_size)
train_l_sum += train_ls[-1] #这里我们只需要在最后一个迭代周期中的损失,因为此时的数值是最优化参数w,b之后得到的最小损失
valid_l_sum += valid_ls[-1]
if i == 0:
#gb.semilogy具体参数可以查看模型选择、欠拟合和过拟合那一章节的内容
gb.semilogy(range(1,num_epochs+1),train_ls,'epochs','rmse',range(1, num_epochs + 1), valid_ls,['train', 'valid'])
print('fold %d,train rmse:%f,valid rmse: %f'%(i,train_ls[-1],valid_ls[-1]))
return train_l_sum/k,valid_l_sum/k
k, num_epochs, lr, weight_decay, batch_size = 3, 30, 0.1, 20, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs,lr,weight_decay, batch_size)
print('%d-fold validation: avg train rmse: %f, avg valid rmse: %f'% (k, train_l, valid_l))
进行K折交叉验证之后,可以观察在迭代过程中,训练集合的误差和测试集合的误差,从而进行超参数的调节:
下一步,对整个训练集合进行预测,并生成竞赛所需要提交的.csv文件:
#定义预测函数(这里没有使用上面的K折交叉,而是重新使用完整的训练数据集来重新训练模型,并保存csv格式)
def train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size):
net = get_net()
train_ls ,_ = train(net,train_features,train_labels,None,None,num_epochs,lr,weight_decay,batch_size)
gb.semilogy(range(1,num_epochs+1),train_ls,'epochs','rmse')
print('train rmse %f'%(train_ls[-1]))
preds = net(test_features).asnumpy()
test_data['SalePrice'] = pd.Series(preds.reshape((-1,1))[:,0])
submission = pd.concat([test_data['Id'],test_data['SalePrice']],axis =1)
submission.to_csv(r'D:\data\submission.csv',index = False)
train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size)
到此基本框架完成,下一步就是调节参数的过程,可以适当增加神经网络的层数、节点数、降低learning rate,增大weight_decay等方法,也可以尝试增加丢弃层等。这个入门级的比赛旨在了解实际情况的建模经验,对数据处理模型建立和参数调节过程有个大概的感觉。
小计:
1)np.clip():
numpy.clip(a, a_min, a_max, out=None),具体用法可参考:https://blog.youkuaiyun.com/HHTNAN/article/details/79799612
2)函数中的*,例如:
train_ls, valid_ls = train(net, *data, num_epochs, learning_rate, weight_decay, batch_size)
* 函数接收参数为元组,** 表示函数接收参数为一个字典。
除此方法外,还可以使用Ridge Regression和Random Forest这两个模型来共同预测,然后求取平均值的方法来的到最后的值。大概过程如下:
房价预测案例
Step 1: 检视源数据集
In [5]:
import numpy as np
import pandas as pd
读入数据
-
一般来说源数据的index那一栏没什么用,我们可以用来作为我们pandas dataframe的index。这样之后要是检索起来也省事儿。
In [6]:
train_df = pd.read_csv('../input/train.csv', index_col=0)
test_df = pd.read_csv('../input/test.csv', index_col=0)
检视源数据
In [7]:
train_df.head()
Out[7]:
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 | <