【ing】
AI即将进入下半场
算法->环境(解决什么问题,怎么评估)
The N implementation details of rlhf with ppo
lm-human-preference的一些代码实现细节
local_seed = seed + rank * 100003
The seed is going to make the model produce different responses and get different scores
The 37 Implementation Details of PPO
An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining
- grokking: the speed of learning a skill
- skill may have different grokking point during training
- it is not the case that the model cannot learn — it is the speed that is being different, and the model learns faster for some formats.
- the more detailed the COT is, the faster the model learn.
- Training on a particular ordering of data can give faster learning speed than training on skill-specific data.
- Different mix ratio may result in different speed of learning
- results of data engineering on model scale less than 30B cannot transfer to model larger than 70B.
softmax1
s o f t m a x ( x ) i = e x p ( x i ) 1 + ∑ j e x p ( x j ) softmax(x)_i=\frac{exp(x_i)}{1+\sum_jexp(x_j)} softmax(x)i=1+∑jexp(xj)exp(xi)
LongChat
- training recipe
- extend 2k to 16k by
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids / ratio)
- finetuning on curated conversation dataset.
- extend 2k to 16k by
- LongEval
- Task 1: Coarse-grained Topic Retrieval
- Task 2: Fine-grained Line Retrieval
A Stage Review of Instruction Tuning. 6.29
- LLaMA-based models over the past three months
- Eval,不要看平均分,看单项
- English knowledge — MMLU
- Chinese knowledge — C-Eval
- Reasoning — GSM8k / BBH
- Coding — HumanEval / MBPP
- MATH: high-difficulty reasoning
- Dialog
kaiokendev.github.io
- With only 15 or so
multi-instruct
examples, the model quality became significantly better - Extending Context to 8K
大语言模型LLaMA, ChatGLM, BLOOM 的高效参数微调实践
- 训练数据, tokenizer, 模型结构细节(layer normalization、激活函数和位置编码);
- prompt tuning、prefix tuning、LLaMA- adapter、LoRA、FullParameter
- LoRA最接近FullParameter,但FullParameter会过拟合
XGen-7B with 8K Sequence Length
training in stages with increasing sequence length. First, 800B tokens with sequence length of 2k tokens are observed, then 400B tokens with 4k, finally, 300B tokens with 8k length.
RAFT
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
code
RRHF
RRHF: Rank Responses to Align Language Models with Human Feedback without tears
code
p i = ∑ t log P π ( y i , t ∣ y i , < t ) ∥ y i ∥ p_i=\frac{\sum_{t}\log P_{\pi}(y_{i,t}|y_{i,<t})}{\|y_i\|}