【强化】Q-learning advanced tips

文章介绍了强化学习中的一些关键技巧,包括DoubleDQN用于缓解Q值高估问题,DuelingDQN通过分离状态价值和优势函数来优化更新,Prioritizedreply策略在经验回放缓冲区中优先选取重要样本,Multi-step学习平衡蒙特卡洛和时间差分学习,以及NoisyNet通过网络参数噪声探索策略,还有DistributionalQ-function对Q值作为分布期望的建模,以更全面地表示动作的价值。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Ref. 李宏毅强化学习课程
在这里插入图片描述

Tips of Q-learning

1. Double DQN

Q value 往往是被高估,double DQN可以环节这个问题
· 为什么会高估?我们想让 Q ( s t , a t ) Q(s_t,a_t) Q(st,at)尽可能接近target r t + m a x a Q ( s t + 1 , a ) r_t + max_a Q(s_{t+1},a) rt+maxaQ(st+1,a), 在estimate network中,由于误差,被高估的actor就会被选择出来
· double DQN是将target换成 r t + Q ′ ( s t , a r g m a x a Q ( s t + 1 , a ) ) r_t + Q'(s_t,argmax_a Q(s_{t+1},a)) rt+Q(st,argmaxaQ(st+1,a)) , Q只负责提案,Q’负责执行,互相制约兜底。Q’是不动的

2. Dueling DQN

改network的架构
在这里插入图片描述
先output一个scaler v(s) , 假设是1.0,在output A(s,a) 并相加
在这里插入图片描述
这样的好处是,更新v就可以把Q都更新了,而不用动A;但是网络倾向更新A,把V都学成0怎么办?给A增加一些constrain,比如先normalize A 再和V相加,相当于削弱A的权重。

Prioritized reply

在buffer里采样的时候,取更有用的
在这里插入图片描述

Multi-step

平衡MC和TD
在这里插入图片描述

Noisy Net

除了greedy之外,还可以在网络参数上加noise
在这里插入图片描述
注意,是在每个episode开始的时候去sample noise network,直到这一轮游戏结束,才再去sample新的。
openAI 直接加了高斯噪声
deep mind 由参数控制的一组noise

Distributional Q-function

Qvalue其实是一个分布的期望值。但是,不同的distribution可能是同样的average值,所以只用qvalue来代表是不足够的。可以变成输出这个action的q值落在每个bin里面的概率;
在这里插入图片描述

### m3e-base Model Information and Resources The `m3e-base` model is part of the broader family of large multimodal models designed to integrate various types of data inputs such as text, images, audio, etc., into a unified framework for processing and understanding[^1]. This particular variant focuses on providing foundational capabilities that can be fine-tuned or adapted for specific applications. #### Key Characteristics - **Architecture**: Built upon transformer architecture principles which have been shown effective across numerous AI tasks since their introduction. - **Multimodality Support**: Capable of handling multiple input formats simultaneously, enhancing its versatility over single-modality alternatives. - **Pre-training Strategy**: Utilizes extensive datasets encompassing diverse modalities during pre-training phases to ensure robust generalization abilities when applied later in downstream scenarios. #### Available Resources For developers looking to work with this model: - **Documentation & Tutorials** Official documentation provides comprehensive guidance covering setup instructions through advanced usage patterns. Additionally, tutorials offer step-by-step examples illustrating how one might apply these techniques effectively within projects requiring cross-modal analysis. - **Community Contributions** Active forums and repositories exist where users share insights about best practices, troubleshooting tips, along with custom implementations showcasing novel ways people are leveraging `m3e-base`. - **Research Papers** Publications discussing theoretical underpinnings behind design choices made while developing `m3e-base`, including comparisons against other state-of-the-art solutions available at present times[^2]. ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("model_name") model = AutoModel.from_pretrained("model_name") inputs = tokenizer("example sentence", return_tensors="pt") outputs = model(**inputs) print(outputs.last_hidden_state) ``` --related questions-- 1. What are some common challenges encountered when training deep neural networks like those used in `m3e-base`? 2. How does the performance of `m3e-base` compare to earlier versions of similar models? 3. Can you provide an overview of recent advancements in multimodal learning frameworks? 4. In what kinds of real-world problems has `m3e-base` demonstrated significant improvements compared to traditional methods?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值