article reading note_l rank = r i <r j ∑ max(0,p i p j )-优快云博客

本文探讨了奖励增强的强化学习方法（RLHF）、大型语言模型如LLaMA、BLOON和ChatGLM的最新进展。重点介绍了RLHF的不同实现，包括RAFT、RRHF和DPO，并讨论了它们在模型对人类反馈的对齐方面的应用。此外，还列举了一系列关于大模型能力的研究论文，重点关注了大型语言模型的推理、知识表示和训练优化方面。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

【ing】

AI即将进入下半场

算法->环境（解决什么问题，怎么评估）

The N implementation details of rlhf with ppo

lm-human-preference的一些代码实现细节

local_seed = seed + rank * 100003 The seed is going to make the model produce different responses and get different scores

The 37 Implementation Details of PPO

An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining

grokking: the speed of learning a skill
skill may have different grokking point during training
it is not the case that the model cannot learn — it is the speed that is being different, and the model learns faster for some formats.
the more detailed the COT is, the faster the model learn.
Training on a particular ordering of data can give faster learning speed than training on skill-specific data.
Different mix ratio may result in different speed of learning
results of data engineering on model scale less than 30B cannot transfer to model larger than 70B.

softmax1

$softmax(x)_i=\frac{exp(x_i)}{1+\sum_jexp(x_j)}$

LongChat

training recipe
- extend 2k to 16k by query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids / ratio)
- finetuning on curated conversation dataset.
LongEval
- Task 1: Coarse-grained Topic Retrieval
- Task 2: Fine-grained Line Retrieval

A Stage Review of Instruction Tuning. 6.29

LLaMA-based models over the past three months
Eval，不要看平均分，看单项
- English knowledge — MMLU
- Chinese knowledge — C-Eval
- Reasoning — GSM8k / BBH
- Coding — HumanEval / MBPP
- MATH: high-difficulty reasoning
- Dialog

kaiokendev.github.io

With only 15 or so multi-instruct examples, the model quality became significantly better
Extending Context to 8K

大语言模型LLaMA, ChatGLM, BLOOM 的高效参数微调实践

训练数据, tokenizer, 模型结构细节(layer normalization、激活函数和位置编码)；
prompt tuning、prefix tuning、LLaMA- adapter、LoRA、FullParameter
- LoRA最接近FullParameter，但FullParameter会过拟合

XGen-7B with 8K Sequence Length

training in stages with increasing sequence length. First, 800B tokens with sequence length of 2k tokens are observed, then 400B tokens with 4k, finally, 300B tokens with 8k length.