article reading note

本文探讨了奖励增强的强化学习方法(RLHF)、大型语言模型如LLaMA、BLOON和ChatGLM的最新进展。重点介绍了RLHF的不同实现,包括RAFT、RRHF和DPO,并讨论了它们在模型对人类反馈的对齐方面的应用。此外,还列举了一系列关于大模型能力的研究论文,重点关注了大型语言模型的推理、知识表示和训练优化方面。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

【ing】

AI即将进入下半场

算法->环境(解决什么问题,怎么评估)

The N implementation details of rlhf with ppo

lm-human-preference的一些代码实现细节

  • local_seed = seed + rank * 100003 The seed is going to make the model produce different responses and get different scores

The 37 Implementation Details of PPO

An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining

  • grokking: the speed of learning a skill
  • skill may have different grokking point during training
  • it is not the case that the model cannot learn — it is the speed that is being different, and the model learns faster for some formats.
  • the more detailed the COT is, the faster the model learn.
  • Training on a particular ordering of data can give faster learning speed than training on skill-specific data.
  • Different mix ratio may result in different speed of learning
  • results of data engineering on model scale less than 30B cannot transfer to model larger than 70B.

softmax1

s o f t m a x ( x ) i = e x p ( x i ) 1 + ∑ j e x p ( x j ) softmax(x)_i=\frac{exp(x_i)}{1+\sum_jexp(x_j)} softmax(x)i=1+jexp(xj)exp(xi)

LongChat

  • training recipe
    • extend 2k to 16k by query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids / ratio)
    • finetuning on curated conversation dataset.
  • LongEval
    • Task 1: Coarse-grained Topic Retrieval
    • Task 2: Fine-grained Line Retrieval

A Stage Review of Instruction Tuning. 6.29

  • LLaMA-based models over the past three months
  • Eval,不要看平均分,看单项
    • English knowledge — MMLU
    • Chinese knowledge — C-Eval
    • Reasoning — GSM8k / BBH
    • Coding — HumanEval / MBPP
    • MATH: high-difficulty reasoning
    • Dialog

kaiokendev.github.io

  • With only 15 or so multi-instruct examples, the model quality became significantly better
  • Extending Context to 8K

大语言模型LLaMA, ChatGLM, BLOOM 的高效参数微调实践

  • 训练数据, tokenizer, 模型结构细节(layer normalization、激活函数和位置编码);
  • prompt tuning、prefix tuning、LLaMA- adapter、LoRA、FullParameter
    • LoRA最接近FullParameter,但FullParameter会过拟合

XGen-7B with 8K Sequence Length

training in stages with increasing sequence length. First, 800B tokens with sequence length of 2k tokens are observed, then 400B tokens with 4k, finally, 300B tokens with 8k length.

RAFT

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
code
在这里插入图片描述

RRHF

RRHF: Rank Responses to Align Language Models with Human Feedback without tears
code
p i = ∑ t log ⁡ P π ( y i , t ∣ y i , < t ) ∥ y i ∥ p_i=\frac{\sum_{t}\log P_{\pi}(y_{i,t}|y_{i,<t})}{\|y_i\|}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值