不懂强化学习基础也能看懂DPO训练过程

        关于DPO 的介绍,我们经常能看的说法是 DPO 发现了一种数学变换,可以直接利用人类偏好数据来优化策略,而不需要显式地训练一个独立的奖励模型。它把奖励模型隐含地表达在了策略优化目标中。DPO 更简单、更稳定,效果常常媲美甚至超过 PPO+RM。只要准备好高质量三元组数据和基础模型,利用llama-factory这个开源工具就可以轻松启动DPO训练!

        但是关键信息“把奖励模型隐含地表达在了策略优化目标中”却很少有简单易懂的博文能讲明白。这里我们假设一个业务场景NL2SQL,我们的目标是希望通过微调大模型,让大模型能够理解业务领域中的库表结构和元数据,实现用户通过自然语言询问,模型生成高质量可执行的SQL语句。(NL2SQL这个业务场景挺常见的,虽然也可以通过RAG和Agent的方式实现,尤其是当业务问题比较简单,仅从几张大宽表就能解决的时候。但强化学习微调大模型确实是个提升准确率的有效手段)

1. DPO的训练过程

(1)数据集示例:

{
  "prompt": "统计2023年各部门销售额",
  "chosen": "SELECT d.dept_name, SUM(s.amount) FROM sales s JOIN departments d ON s.dept_id=d.id WHERE s.year=2023 GROUP BY d.dept_name",
  "rejected": "SELECT dept_name, SUM(amount) FROM sales WHERE year=2023" // 缺少JOIN导致结果错误
}

 (2)2个关键模型

  1. 待训练的策略模型 (π_θ): 这就是我们想要训练好的最终模型。它接收用户prompt(自然语言问题),输出SQL语句。

  2. 参考模型 (π_ref): 通常是一个微调过的基础模型(例如,用指令微调或SFT微调过的模型,使其初步理解SQL任务)。它代表了训练前的“基准行为”。在DPO中,它固定不变,用于提供“锚点”,防止模型偏离太远或走捷径。

    这里需要强调一下,这个模型兼具: 掌握基础SQL语法能力。理解业务领域特定的库表结构、元数据关系(最重要!)。初步建立自然语言问题到SQL结构的映射。

 (3)前向计算与反向迭代

假设我们现在只有那一条训练样本。

  1. 前向传播 (Forward Pass):

    • prompt (“统计2023年各部门销售额”) 同时输入给待训练模型 (π_θ) 和 参考模型 (π_ref)

    • 模型的任务不是生成完整的SQL,而是计算给定输出序列(chosen SQL 和 rejected SQL)的概率(或更准确地说,对数似然)

    • 计算:

      • π_θ(chosen | prom

### Distributed Private Optimization in Reinforcement Learning Implementation and Application In the context of distributed private optimization (DPO), this approach aims at optimizing a global objective function while ensuring privacy preservation across multiple agents or nodes within a network. In reinforcement learning, DPO can be particularly beneficial when dealing with multi-agent systems where each agent has its own local data but contributes towards achieving a common goal without revealing sensitive information about individual contributions. The core idea behind implementing DPO in reinforcement learning involves coordinating actions between different entities through sparse interactions that minimize direct communication overheads[^1]. This method allows for efficient collaboration among agents by focusing on interdependencies rather than requiring constant updates from all participants simultaneously. To achieve coordination under such constraints, algorithms like federated learning techniques are employed which enable decentralized training processes over networks composed of numerous devices holding potentially heterogeneous datasets locally. These methods ensure model parameters converge toward optimal solutions collectively despite limited exchange of knowledge during runtime operations. For practical applications involving DPO in RL settings: - **Privacy-Preserving Multi-Agent Systems**: Ensures secure cooperation amongst autonomous vehicles navigating urban environments collaboratively. - **Federated Deep Q-Networks (FDQN)**: Combines principles of both FL and DRL allowing robots operating independently yet sharing experiences indirectly via aggregated policy gradients periodically exchanged after sufficient exploration phases have occurred individually first before synchronization takes place again later down line as part ongoing iterative refinement cycles until convergence reached eventually leading up final trained neural net architecture capable performing desired tasks effectively enough meet performance criteria set forth initially project outset stage itself indeed. ```python import numpy as np from fedavg import FederatedAveraging class DPORLAgent: def __init__(self, env, id): self.env = env self.id = id # Initialize other necessary components here def train(self, episodes=1000): for episode in range(episodes): state = self.env.reset() done = False while not done: action = self.choose_action(state) next_state, reward, done, _ = self.env.step(action) # Perform experience replay using local buffer only if condition_to_sync(): params = gather_params_from_all_agents() # Sparse interaction point updated_params = FederatedAveraging(params).update_weights() distribute_updated_parameters(updated_params) state = next_state ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值