2021-01-07

这篇博客展示了如何利用R中的mlr3库进行数据预处理,包括缺失值填充,以及使用ranger learner进行回归分析。通过创建Pipeline进行数据处理,并使用交叉验证评估模型性能,强调了变量重要性在预测中的作用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

非常好的mlr3案例集合,原文来源于:

https://mlr3gallery.mlr-org.com/posts/2020-05-04-moneyball/

这里仅仅是把代码复制下来方便以后的查找和使用:

library("mlr3")
library("mlr3learners")
library("mlr3pipelines")
requireNamespace("mlr3measures")
library("mlr3data")
# 查看缺失数据
skim(moneyball)

# 根据数据类型填充缺失数据
imp_num = po("imputehist", param_vals = list(affect_columns = selector_type(c("integer", "numeric"))))
imp_fct = po("imputeoor", param_vals = list(affect_columns = selector_type("factor")))
graph = imp_num %>>% imp_fct
graph$plot()
task = TaskRegr$new(id = "moneyball", backend = moneyball, target = "rs")
task$missings()
# creates a learner
test_lrn = LearnerRegrRanger$new()
# displays the properties
test_lrn$properties
# Creates pipo to imputate missing values then goes to learners
# creates a normal learner however allows further embedding of PipeOp's.
polrn = PipeOpLearner$new(mlr_learners$get("regr.ranger"))
# sets number of trees to 1000, importance is for later
polrn$param_set$values = list(num.trees = 1000, importance = "permutation")
# the final learner is a graph consisting of the imputer and the normal learner.
lrn = GraphLearner$new(graph = graph %>>% polrn)
# defines the training and testing data; 95% is used for training
train_set = sample(task$nrow, 0.95 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)
# train learner on subset of task
lrn$train(task, row_ids = train_set)
# predict using held out observations
preds = lrn$predict(task, row_ids = test_set)
print(preds)
#抽样核对
cv10 = rsmp("cv", folds = 10)
r = resample(task, lrn, cv10)
scores = r$score(msrs(c("regr.mae", "regr.mse")))
scores
r$aggregate(msr("regr.mae"))
imp_fct = po("imputeoor", param_vals = list(affect_columns = selector_type("factor")))
graph2 = as_graph(imp_fct)
# Ignores two features then recalculates for comparing
feature_names = colnames(moneyball)[!sapply(moneyball, anyNA)]
feature_names = c(
  feature_names[feature_names %in% task$feature_names],
  "rankseason", "rankplayoffs")
na_select = po("select")
na_select$param_set$values$selector = selector_name(feature_names)
graph2 = graph2 %>>% na_select
graph2$plot()
lrn2 = GraphLearner$new(graph = graph2 %>>% polrn)
r2 = resample(task, lrn2, cv10)
r2$aggregate(msr("regr.mae"))
# variable importance
sort(lrn$model$regr.ranger$model$variable.importance, decreasing = TRUE)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值