kaggle bike sharing program R code tranlated from python code ranked 9th

本文详细介绍了如何使用R语言中的lubridate、Metrics、caret等包,通过读取并预处理自行车共享需求数据集,进行特征选择、数据转换和构建随机森林模型的过程。重点涵盖了时间序列数据的日期操作、特征工程、模型训练与评估,最终通过RFE(Recursive Feature Elimination)确定最优特征子集,实现对自行车共享需求的预测。
Use Caret to do the feature selection for rf.
train_row <- read.csv('E:/kuaipan/Kaggle Project/Bike Sharing Demand/train.csv')
test_row <- read.csv('e:/kuaipan/Kaggle Project/Bike Sharing Demand/test.csv')

train_row$type = 'train'
test_row$type = 'test'

colnames(train_row)

test_row$casual=NA
test_row$registered=NA
test_row$count=NA

row_df = rbind(train_row, test_row)
row_df_backup = row_df

for(col in c('casual','registered','count'))
{
  newcol = paste(col,'_log',sep="")
  row_df[newcol] = log(row_df[col] + 1)
}

if(!require('lubridate'))
{
  install.packages(lubridate)
}

row_df$dt = ymd_hms(row_df$datetime)
row_df$day = day(row_df$dt)
row_df$month = month(row_df$dt)
row_df$year = year(row_df$dt)
row_df$hour = hour(row_df$dt)
row_df$weekday =wday(row_df$dt)
row_df$week = week(row_df$dt)

for(s in c(1,2,3,4))
{
  row_df[which(row_df$season==s & row_df$type=='train'),'season_count']=sum(row_df[which(row_df$season==s & row_df$type=='train'),'count'])
}

#head(row_df)

SetWorkingDay = function(year,month,day,value){
  row_df[which(row_df$year %in% year & row_df$month %in% month & row_df$day %in% day),'workingday'] = value
  if(value == 1){
    b = 0
  }else{
    b = 1
  }
  row_df[which(row_df$year %in% year & row_df$month %in% month & row_df$day %in% day),'holiday'] = b
    
}

SetWorkingDay(2011,4,15,1)
SetWorkingDay(2012,4,16,1)
SetWorkingDay(2011,11,25,0)
SetWorkingDay(2012,11,23,0)

row_df[which(row_df$year == 2011 &row_df$month ==11 & row_df$day ==25),'holiday']=1
row_df[which(row_df$year == 2012 &row_df$month ==11 & row_df$day ==23),'holiday']=1
row_df[which(row_df$year == 2012 &row_df$month ==5 & row_df$day ==21),'holiday']=1
row_df[which(row_df$year == 2012 &row_df$month ==6 & row_df$day ==1),'holiday']=1
row_df[which(row_df$year == 2012 &row_df$month ==10 & row_df$day ==30),'holiday']=1
row_df[which(row_df$month ==12 & row_df$day %in% c(24,26,31)),'holiday']=1
row_df[which(row_df$month ==12 & row_df$day %in% c(24,31)),'workingday']=1

row_df[which(row_df$workingday==1 & row_df$hour %in% c(8,17,18,12)),'peak']=1
row_df[which(row_df$workingday==0 & 10<=row_df$hour<=19),'peak']=1
row_df[which(is.na(row_df$peak)),'peak']=0

row_df[which(row_df$temp>27 & row_df$windspeed <30),'ideal']=1
row_df[which(is.na(row_df$ideal)),'ideal']=0

row_df[which(row_df$humidity>=60&row_df$workingday==1),'sticky']=1
row_df[which(is.na(row_df$sticky)),'sticky']=0

row_df.train = row_df[which(row_df$type == 'train'),]
row_df.test  = row_df[which(row_df$type == 'test'),]

library(Metrics)
get_rmsle =function(pred, actual){
  rs = rmsle(log(pred+1),log(actual+1))
  sqrt((exp(rs)))
}

library(caret)

#use ten-fold cross validation 
control=rfeControl(functions=rfFuncs, method="cv", number=10)
row_df.features = row_df.train[,-c(10,11,12,14,15,16,1,17,13)]
#Feature dataset and result dataset must be same dataset
#By default the rfe will add a subset contain all the features.
result=rfe(row_df.train[,-c(10,11,12,14,15,16,1,17,13)],row_df.train[,14],size=c(16:17),rfeControl = control) 
plot(result, type=c('p','l'))

#the final rf model is result$fit


简要说明下, 首先用到了一个非常有名的R的包 lubridate ,这个包可以从字符串得到date类型, 并对date类型进行对比.

Metrics包包含计算一些监督学习中常用的值的函数,比如求实际值和观测值的平均方差.

caret包博大精深: 详见 http://blog.youkuaiyun.com/jiabiao1602/article/details/44975741

在这里这段代码用的是随机森林(rfFuncs),外面套了一个十折交叉验证:

method="cv", number=10
最后一个函数是rfe, 这个其实不是用来作预测的算法,而是Recursive Feature Elimination(迭代式特征消除),决定哪些特征是模型中真正有用的特征,予以保留。换句话说就是用十折交叉验证来多次实验,基于随机森林模型的表现确定哪些特征留用,同时确定了模型。



评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

爱知菜

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值