【ing】
attention sink
知乎:深度解析 Attention Sinks 究竟为啥有效?
《How Attention Sinks Keep Language Models Stable》
- 模型会将大量的注意力“倾倒”在最初的几个token上
- 两种实现:
- StreamingLLM方法是在序列开头设置一个专门的、可学习的池token k 0 k_0 k0,其注意力计算方式为: a t t n ( q i , K ) = s o f t m a x ( q i T K c , q i T k 0 ) attn(q_i, K)=softmax(q_i^TK_c, q_i^Tk_0) attn(qi,K)=softmax(qiTKc,qiTk0) ,其中 K c K_c Kc是内容token。
- OpenAI则采用了一个更简洁的通用标量方法: a t t n ( q i , K ) = s o f t m a x ( q i T K c , b ) attn(q_i, K)=softmax(q_i^TK_c, b) attn(qi,K)=softmax(qiTKc,b)
AI即将进入下半场
算法->环境(解决什么问题,怎么评估)
The N implementation details of rlhf with ppo
lm-human-preference的一些代码实现细节
local_seed = seed + rank * 100003The seed is going to make the model produce different responses and get different scores
The 37 Implementation Details of PPO
An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining
- grokking: the speed of learning a skill
- skill may have different grokking point during training
- it is not the case that the model cannot learn — it is the speed that is being different, and the model learns faster for some formats.
- the more detailed the COT is, the faster the model learn.
- Training on a particular ordering of data can give faster learning speed than training on skill-specific data.
- Different mix ratio may result in different speed of learning
- results of data engineering on model scale les

本文探讨了奖励增强的强化学习方法(RLHF)、大型语言模型如LLaMA、BLOON和ChatGLM的最新进展。重点介绍了RLHF的不同实现,包括RAFT、RRHF和DPO,并讨论了它们在模型对人类反馈的对齐方面的应用。此外,还列举了一系列关于大模型能力的研究论文,重点关注了大型语言模型的推理、知识表示和训练优化方面。
最低0.47元/天 解锁文章
5996

被折叠的 条评论
为什么被折叠?



