离线强化学习(Offline RL)系列5: (模型参数) 离线强化学习中的超参数选择(Offline Hyperparameters Selection)

在这里插入图片描述

论文信息: Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, Nando de Freitas: “Hyperparameter Selection for Offline Reinforcement Learning”, 2020; arXiv:2007.09055.

本文由DeepMind和Google合作,由Tom Le Paine以第一作者完成并被NeurIPS2021 接收为Accept (Poster),评审意见是:The manuscript examines the question of how to improve policy selection in the off-line RL setting. Typically offline policy selection is approached via off-policy evaluation (OPE), aimed at estimating the expected return of candidate policies. OPE is itself a difficult problem that typical requires hyperparameter tuning and selection itself. The paper develops moves closer to a hyperparameter-free method and demonstrates the effectiveness of the algorithm in the context of standardized offline datasets (e.g. RLUnplugged for Atari). The algorithm for policy selection is built using insights from the recently published Batch Value-Function Tournament (BVFT) approach to estimating the best value function from among a set of candidates. They make comparisons to well developed OPE style methods such as fitted Q-evaluation and show clear advantages in data efficiency and the policy selection. The manuscript examines applying the approach to a wide range of settings (from Atari to continuous control) and to a range of policies produced by a variety of algorithms. The ideas, theory, and experiments are well motivated by the text. Taken together, the manuscript provides a promising look at a fundamental and open problem in RL.

摘要: 离线强化学习的数据集、数据集的特征、采样复杂性以及算法实现在之前的博客中已经阐述了很多,此外,对算法效率还有一个非常重要的影响特性:超参数的选择,本文作者就该过程进行了阐述,并提出了使用3种指标衡量选择效果,最后基于FQE算法实验,通过与常见的CRR等算法进行对比。

1. 问题描述

1.1 监督学习超参数选择与调优

在监督学习中,常见的学习率、网络结构等超参数对模型的收敛都有非常大的影响,Google了一下,在监督学习领域目前比较常见的叫法不是hyperparameter selection, 普遍是hyperparameter tuning/optimization, 典型的优化过程定义了可能的超参数集以及针对该特定问题要最大化或最小化的度量, 实践中遵循以下步骤:

(1)将数据集拆分为训练和测试子集
(2)重复优化循环固定次数或直到满足条件:

  • 选择一组新的模型超参数
  • 使用选定的超参数集在训练子集上训练模型
  • 将模型应用于测试子集并生成相应的预测
  • 使用针对手头问题的适当评分指标评估测试预测,例如准确度或平均绝对误差。 存储对应于所选超参数集的度量值

(3)比较所有度量值并选择产生最佳度量值的超参数集

而常见的方法主要包括以下四种

  • Grid search
  • Random search
  • Hill climbing
  • Bayesian optimization

比如针对函数 f ( x ) = s i n ( x / 2 ) + 0.5 ⋅ s i n ( 2 ⋅ x ) + 0.25 ⋅ c o s ( 4.5 ⋅ x ) f(x) = sin(x/2) + 0.5⋅sin(2⋅x) +0.25⋅cos(4.5⋅x) f(x)=sin(x/2)+0.5

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

@RichardWang

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值