Scaling Laws for Precision 翻译-优快云博客

本文链接：https://blog.youkuaiyun.com/Doc2X/article/details/143937734

该文档由Doc2X翻译提供解析与翻译, 想看更多论文翻译欢迎来Doc2X
This document is provided with parsing and translation by Doc2X. For more translated papers, feel free to visit Doc2X.
文章地址 https://arxiv.org/pdf/2411.04330

Scaling Laws for Precision

精度的缩放定律

Tanishq Kumar1 Zachary Ankner3,4 Benjamin F. Spector ${}^{2}$ Blake Bordelon ${}^{1}$ Niklas Muennighoff ${}^{2}$ Mansheej Paul ${}^{4}$ Cengiz Pehlevan ${}^{1}$ Christopher Ré ${}^{2}$ Aditi Raghunathan ${}^{5}$

${}^{1}$ Harvard University

${}^{1}$ 哈佛大学

${}^{2}$ Stanford University

${}^{2}$ 斯坦福大学

${}^{3}$ MIT

${}^{3}$ 麻省理工学院

${}^{4}$ Databricks

${}^{5}$ Carnegie Mellon University

${}^{5}$ 卡内基梅隆大学

Abstract

摘要

Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise “precision-aware” scaling laws for both training and inference. We propose that training in lower precision reduces the model’s effective parameter count, allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.

低精度训练和推理会影响语言模型的质量和成本，但当前的缩放定律并未考虑这一点。在本文中，我们设计了针对训练和推理的“精度感知”缩放定律。我们提出，在较低精度下训练会减少模型的有效参数数量，从而允许我们预测因低精度训练和后训练量化而导致的额外损失。对于推理，我们发现后训练量化引入的退化随着模型在更多数据上训练而增加，最终使得额外的预训练数据变得有害。对于训练，我们的缩放定律允许我们预测不同部分在不同精度下的模型损失，并表明在较低精度下训练更大的模型可能是计算最优的。我们将后训练和预训练量化的缩放定律统一起来，得出一个单一的功能形式，预测在不同精度下训练和推理的退化。我们在超过465次预训练运行上进行了拟合，并在模型大小高达1.7B参数、训练数据高达26B标记上验证了我们的预测。

1 Introduction

1 引言

Scale has emerged as a central driver of progress in deep learning Brown, 2020]. Key work on scaling Kaplan et al., 2020, Hoffmann et al., 2022 studied tradeoffs between model/dataset size to balance performance and compute. However, the precision in which models are trained and served is an important third factor that contributes to both cost and performance. Deep learning is trending towards lower precision: current frontier models like the Llama-3 series are trained in BF16 Dubey et al. 2024], and there is widespread effort to move the pretraining paradigm to FP8 Micikevicius et al., 2022. The next generation of hardware will support FP4, and advances in weight-only quantization have led to training in binary and ternary at scale Ma et al. 2024, Wang et al., 2023. How far will these paradigms go? Specifically, we ask:

规模已成为深度学习进步的核心驱动力 [Brown, 2020]。关于规模化的关键工作 [Kaplan et al., 2020, Hoffmann et al., 2022] 研究了模型/数据集大小之间的权衡，以平衡性能和计算。然而，模型训练和服务的精度是一个重要的第三因素，它对成本和性能都有贡献。深度学习正趋向于更低精度：当前的前沿模型如Llama-3系列使用BF16进行训练 [Dubey et al. 2024]，并且有广泛的努力将预训练范式迁移到FP8 [Micikevicius et al., 2022]。下一代硬件将支持FP4，权重唯一量化的进步已经使得在大规模上进行二进制和三进制训练成为可能 [Ma et al. 2024, Wang et al., 2023]。这些范式将走多远？具体而言，我们问道：

What are the tradeoffs between precision, parameters, and data? How do they compare for pretraining and inference?

精度、参数和数据之间的权衡是什么？它们在预训练和推理中的比较如何？

Studying scaling in precision is challenging because work on scaling laws generally aims to drop fine-grained implementation details in pursuit of universal functional forms while work on quantization generally does the opposite, focuses on the details: how quantization is done, with what type, to what part of the model. In seeking a balance, we consider a variety of plausible functional forms, and choose one that abstracts implementation details of quantization away from loss scaling,

研究精度中的规模化具有挑战性，因为关于规模定律的工作通常旨在忽略细粒度的实现细节，以追求普遍的函数形式，而关于量化的工作则通常相反，专注于细节：如何进行量化，使用什么类型，对模型的哪部分进行量化。在寻求平衡时，我们考虑了多种可能的函数形式，并选择了一种将量化的实现细节从损失缩放中抽象出来的形式，

*Equal contribution. Correspondence to tkumar@college.harvard.edu

*同等贡献。联系邮箱 tkumar@college.harvard.edu

在这里插入图片描述

Figure 1: Schematic of key findings. (Left) Training a fixed model size to various data budgets in BF16 and quantizing weights at the end. We find that degradation due to post-train quantization increases with tokens seen during pretraining, so that eventually additional pretraining data can be harmful. (Right) Our scaling suggests training larger models in lower precision can be compute-optimal according to the cost model in Section 4.3. Weights, activations, attention quantized, all models trained on the same data budget, details in Appendix H.

图1：关键发现的示意图。（左）在BF16中训练固定模型大小到不同的数据预算，并在最后进行权重量化。我们发现，由于后训练量化导致的退化随着预训练期间看到的标记数量增加而增加，以至于最终额外的预训练数据可能是有害的。（右）我们的扩展建议根据第4.3节中的成本模型，在较低精度下训练更大的模型可能是计算最优的。权重、激活、注意力均被量化，所有模型均在相同的数据预算上训练，详细信息见附录H。

allowing us to predict loss scaling in many situations of practical interest. This functional form that posits bit precision and parameter count interchangeably contribute to a model’s “effective parameter count,” ${N}_{\text{eff }}$ ,and implementation details like which parts of a model are quantized to what precision, interact with loss scaling only through their effect on this quantity.

这使得我们能够在许多实际感兴趣的情况下预测损失缩放。这种函数形式假设比特精度和参数数量可以互换地贡献于模型的“有效参数数量”， ${N}_{\text{eff }}$ ，并且像模型的哪些部分被量化到什么精度的实现细节，仅通过它们对此数量的影响与损失缩放相互作用。

Overall, we study the scaling of the effects of precision on loss as we vary data and parameters, both during and after training. We first study how the degradation induced by post-train quantization scales with parameters and data. We find that the degradation increases with data, so that for a fixed model, training on additional data after a certain point can be actively harmful if the model will be quantized after training. We then shift our focus to quantized training, examining both the quantization-aware-training (weights only) and low-precision training (weights, activations, attention all quantized) settings. Our scaling laws for pretraining suggest that the compute-optimal pretraining precision is in general independent of compute budget. Surprisingly, however, this independence ceases to be true if model size is constrained, in which case the compute-optimal precision grows slowly in compute.

总的来说，我们研究了在训练期间和训练后，精度对损失的影响随数据和参数的变化情况。我们首先研究了后训练量化引起的退化如何随参数和数据扩展。我们发现退化随数据增加而增加，因此对于固定模型，在某个点之后在额外数据上训练如果模型将在训练后进行量化，可能会actively有害。然后我们将重点转向量化训练，考察了量化感知训练（仅权重）和低精度训练（权重、激活、注意力均量化）的设置。我们的预训练扩展法则表明，计算最优的预训练精度通常与计算预算无关。然而，令人惊讶的是，如果模型大小受到限制，这种独立性就不再成立，在这种情况下，计算最优精度随计算缓慢增长。

In all, we pretrain a suite of 465 language models in 3 to 16 bit precisions, as well as post-train quantize each to multiple precisions. For a language model with $N$ parameters,trained on $D$ tokens with training precision ${P}_{\text{train }}$ ,and post-train weight precision ${P}_{\text{post }}$ ,we ultimately find a unified scaling law that takes the following form:

总的来说，我们在3到16位精度范围内预训练了465个语言模型，并对每个模型进行了多精度的后训练量化。对于一个具有 $N$ 参数的语言模型，在 $D$ 个标记上以 ${P}_{\text{train }}$ 的训练精度进行训练，并且后训练权重精度为 ${P}_{\text{post }}$ ，我们最终发现了一个统一的缩放定律，其形式如下：

在这里插入图片描述

where $A,B,E,\alpha ,\beta$ are positive fitted constants,and ${\delta }_{\mathrm{{PTQ}}}$ refers to the loss degradation induced by post-training quantization before inference. Altogether, our results for post-train quantization illustrate how more pretraining FLOPs do not always lead to better models at inference-time, and our results for low-precision pretraining suggest that both the standard practice of training models in 16-bit, and the race to extremely low (sub 4-bit) pretraining precision, may be suboptimal.

其中 $A,B,E,\alpha ,\beta$ 是正的拟合常数， ${\delta }_{\mathrm{{PTQ}}}$ 指的是在推理前由后训练量化引起的损失退化。总的来说，我们关于后训练量化的结果表明，更多的预训练FLOPs并不总是导致在推理时更好的模型，而我们关于低精度预训练的结果表明，标准的16位训练模型的做法以及追求极低（低于4位）预训练精度的竞赛可能都不是最优的。

2 Background, Related Work, and Setup

2 背景、相关工作及设置

Notation. Throughout, $D$ denotes dataset size in tokens and $N$ denotes model size in parameters. ${P}_{\mathrm{w}},{P}_{\mathrm{a}},{P}_{\mathrm{{kv}}}$ refer to the bit precision,in integer-type,of the weights,activations,and key-value cache (“attention”) during training,and ${P}_{\text{post }}$ refers to the precision we post-train quantize (PTQ) weights to at the end for model inference. When $P$ or ${P}_{\text{train }}$ is used without reference to a part of the model, all three model parts are tied to the same precision. The inference-time loss degradation induced by post-train quantization will be denoted ${\delta }_{\mathrm{{PTQ}}}\left( {N,D,{P}_{\text{train }},{P}_{\text{post }}}\right)$ ,and it is defined as the change in loss from performing post-training quantization compared to the end of pretraining. We use “high precision” to mean 16-bit or above.

符号说明。全文中， $D$ 表示数据集大小（以标记为单位）， $N$ 表示模型大小（以参数为单位）。 ${P}_{\mathrm{w}},{P}_{\mathrm{a}},{P}_{\mathrm{{kv}}}$ 指的是在训练期间权重、激活和键值缓存（“注意力”）的位精度（整数类型）， ${P}_{\text{post }}$ 指的是我们在模型推理结束时对权重进行后训练量化（PTQ）的精度。当 $P$ 或 ${P}_{\text{train }}$ 使用时未指明模型的某部分，则所有三个模型部分都绑定到相同的精度。由后训练量化引起的推理时损失退化将表示为 ${\delta }_{\mathrm{{PTQ}}}\left( {N,D,{P}_{\text{train }},{P}_{\text{post }}}\right)$ ，其定义为与预训练结束时相比进行后训练量化所引起的损失变化。我们使用“高精度”来指代16位或以上。

2.1 Quantization Fundamentals: How, What, When

2.1 量化基础：如何、什么、何时

The Problem: Compute vs Memory-Bound Workloads. Most deep learning workloads are bottlenecked by either compute, in the form of matrix multiplications, or memory bandwidth, in the form of data movement between different parts of the GPU. Different types of workloads have different bottlenecks: most time is spent doing large matrix multiplications during pretraining, so it is compute-bound; in contrast, small-batch inference is bandwidth-bound by model weights; long-sequence decoding is bandwidth-bound by KV cache, etc. This motivates studying scaling in the training precision of the (weights, activations, KV cache) both in isolation and in combination.

问题：计算受限与内存受限的工作负载。大多数深度学习工作负载要么受限于计算，表现为矩阵乘法的形式，要么受限于内存带宽，表现为GPU不同部分之间的数据移动。不同类型的工作负载有不同的瓶颈：在预训练期间，大部分时间用于进行大型矩阵乘法，因此它是计算受限的；相比之下，小批量推理受限于模型权重的带宽；长序列解码受限于KV缓存的带宽，等等。这促使我们研究在训练精度中（权重、激活、KV缓存）的缩放，既独立又结合。

Quantization: How. Quantization of an operation typically refers to rounding of values in matrices involved in some computation on the forward/backward pass, with accumulation of gradients in high/full precision. Quantization is usually done to integer or floating-point type.

量化：如何进行。量化操作通常指在正向/反向传播中涉及的计算矩阵中的值进行四舍五入，同时在高/全精度中累积梯度。量化通常是对整数或浮点类型进行的。

Quantization: What. Only weights. “Quantization-aware training” Quantizing only weights during training does not offer any compute savings because matrix multiplications are still done in high precision. However, this is commonly done to allow weights to adapt to low precision so they can be served at very low precision at inference-time, thereby alleviating memory bottlenecks Ma et al., 2024, Wang et al., 2023. We will refer to this as “quantization-aware-training.”

量化：什么内容。仅权重。“量化感知训练” 在训练期间仅量化权重不会带来任何计算节省，因为矩阵乘法仍然在高精度中进行。然而，这通常是为了使权重适应低精度，以便在推理时以非常低的精度提供服务，从而缓解内存瓶颈（Ma等人，2024年；Wang等人，2023年）。我们将此称为"量化感知训练"。

Weights, activations, attention. “Low-precision training” Quantizing and activations and attention in addition to weights allows for compute gains because matrix multiplications can be done in low precision (if the hardware supports it) since everything is in the same precision. FP8 training on the Hopper line of GPUs is an example Micikevicius et al., 2022]. We will refer to this setting as “low-precision training” to distinguish it from quantization-aware training.

权重、激活、注意力。“低精度训练” 除了权重外，还对激活和注意力进行量化，可以带来计算增益，因为矩阵乘法可以在低精度中进行（如果硬件支持的话），因为一切都是相同的精度。Hopper系列GPU上的FP8训练就是一个例子（Micikevicius等人，2022年）。我们将此设置称为"低精度训练"，以区别于量化感知训练。

Quantization: When. Quantization can be done during or after training. In practice, when seeking to reduce inference-time memory costs, one first attempts post-train quantization. If that degrades the model too much, quantization-aware-training is used. Post-train quantization is typically only applied to model weights Frantar et al., 2022, Dettmers et al., 2022, Lin et al., 2023, Xiao et al., 2023. To reduce pretraining costs, low-precision-training is needed. We will study scaling laws for post-training quantization in Section 3, for quantized training in Section 4 (examining both quantization-aware training and low precision training) and unify the two in Section 5. The numerical values of all our fitted constants can be found in Appendix 1.

量化：何时进行。量化可以在训练期间或之后进行。在实践中，当试图减少推理时间的内存成本时，首先尝试后训练量化。如果这导致模型性能下降过多，则使用量化感知训练。后训练量化通常仅应用于模型权重（Frantar等人，2022年；Dettmers等人，2022年；Lin等人，2023年；Xiao等人，2023年）。为了减少预训练成本，需要低精度训练。我们将在第3节研究后训练量化的缩放定律，在第4节研究量化训练（考察量化感知训练和低精度训练），并在第5节将两者统一起来。所有拟合常数的数值可以在附录1中找到。

${}^{1}$ We study KV,rather than QKV,because understanding scaling in the KV cache alone is important for many inference settings. For pretraining claims in Section 4.3, we quantize the entire attention computation, including queries, finding additionally quantizing the query vectors makes a negligible difference to scaling.

${}^{1}$ 我们研究KV，而不是QKV，因为仅理解KV缓存的缩放对于许多推理设置非常重要。在第4.3节中的预训练声明中，我们对整个注意力计算进行量化，包括查询，发现额外量化查询向量对缩放的影响可以忽略不计。

2.2 Scaling Laws and Parametric Fits

2.2 缩放定律和参数拟合

Scaling Laws. Hoffmann et al. 2022 model loss scaling using the functional form $L\left( {N,D}\right) =$ $A{N}^{-\alpha } + B{D}^{-\beta } + E$ where $A,B,\alpha ,\beta ,E$ are positive fitted constants,finding that data and parameters should be scaled in roughly equal proportion as more compute becomes available. We will refer to the scaling of Hoffmann et al. 2022 as “Chinchilla-optimal” or just “Chinchilla” and note this is often used colloquially as $\approx {20}$ being pretraining compute-optimal. On the theoretical front, work on scaling laws Bahri et al., 2024, Bordelon et al., 2024, Lin et al., 2024a finds that noise to various parts of model or data affects loss in a predictable way. While previous works have explored the scaling behavior of post-training quantization in terms of total model bits Dettmers and Zettlemoyer, 2023 and knowledge capacity Allen-Zhu and Li, 2024, we focus instead on data scaling. We note that in general the exact fitted values of all coefficients and exponents can vary drastically based on small implementation differences: Besiroglu et al. 2024 find different constants when attempting to replicate [Hoffmann et al., 2022], Sardana and Frankle 2023] fit coefficients $A, B$ of different orders of magnitude. For this reason,we emphasize our contribution is not the numerical values we fit, but the trends and functional forms we identify.

缩放定律。Hoffmann等人于2022年使用函数形式 $L\left( {N,D}\right) =$ $A{N}^{-\alpha } + B{D}^{-\beta } + E$ 对损失缩放进行建模，其中 $A,B,\alpha ,\beta ,E$ 是正拟合常数，发现随着计算能力的增加，数据和参数应大致按相同比例缩放。我们将Hoffmann等人2022年的缩放称为“Chinchilla最优”或简称“Chinchilla”，并注意到这通常被俗称为 $\approx {20}$ 表示预训练计算最优。在理论方面，关于缩放定律的工作（Bahri等人，2024年；Bordelon等人，2024年；Lin等人，2024a）发现，模型或数据各部分的噪声以可预测的方式影响损失。尽管先前的研究探讨了后训练量化的缩放行为，涉及总模型位数（Dettmers和Zettlemoyer，2023年）和知识容量（Allen-Zhu和Li，2024年），我们则专注于数据缩放。我们注意到，通常所有系数和指数的确切拟合值可能会因小的实现差异而大幅变化：Besiroglu等人（2024年）在尝试复制[Hoffmann等人，2022年；Sardana和Frankle，2023年]时发现了不同的常数，拟合了不同数量级的系数 $A, B$ 。因此，我们强调我们的贡献不在于我们拟合的数值，而在于我们识别的趋势和函数形式。

Overtraining. In practice, accounting for inference costs means training smaller models for substantially longer than Chinchilla-optimal Sardana and Frankle, 2023, Gadre et al., 2024. For instance,Llama-3-8B is trained to $\approx {2000}$ [Dubey et al.,2024] and the Gemma-2 series up to $D/N > {1000}$ [Team et al.,2024. We refer to such models as “overtrained” in this paper,with the token/parameter ratio $D / N$ being a key quantity throughout. Work on inference-time compute [Snell et al., 2024, Brown et al., 2024] and on synthetic and multimodal data [Yang et al., 2024, Fan et al., 2024, Bauer et al., 2024 suggests future models may be even more overtrained. Therefore, modern work on scale must consider ratios much larger than Chinchilla-optimal, and in this work we perform experiments up to $\approx {10}^{3}$ and analyze the predictions found by our scaling law for up to $\approx {10}^{5}$ . See Appendix B for additional related work.

过度训练。在实践中，考虑到推理成本意味着训练比 Chinchilla-最优模型更小的模型，且训练时间要长得多（Sardana 和 Frankle, 2023, Gadre 等人, 2024）。例如，Llama-3-8B 模型被训练到 $\approx {2000}$ （Dubey 等人, 2024），而 Gemma-2 系列模型则被训练到 $D/N > {1000}$ （Team 等人, 2024）。在本文中，我们将此类模型称为“过度训练”模型，其中令牌/参数比 $D / N$ 是一个关键量。关于推理时间计算（Snell 等人, 2024, Brown 等人, 2024）以及合成和多模态数据（Yang 等人, 2024, Fan 等人, 2024, Bauer 等人, 2024）的研究表明，未来的模型可能会更加过度训练。因此，现代关于规模的研究必须考虑比 Chinchilla-最优大得多的比率，在本研究中，我们进行了高达 $\approx {10}^{3}$ 的实验，并分析了我们缩放定律预测的最多 $\approx {10}^{5}$ 的情况。有关相关工作，请参见附录 B。

2.3 Setup

2.3 设置

We train and evaluate a suite of OLMo-style models on the Dolma V1.7 dataset Groeneveld et al. 2024, Soldaini et al., 2024, using a standard Transformer++ implementation; see Appendix A for hyperparameters and ablations. Our experiments consist of a sweep of language model pretraining runs over $\in \left\lbrack {{30},{60},{110},{220}}\right\rbrack$ million parameters (non-embedding) and $\in \left\lbrack {{1.5},3,6,{13},{26}}\right\rbrack$ billion tokens. Our model sizes are relatively small because we train up to a very high $\approx {10}^{3}$ to study data scaling and set off over 20 runs at every(N,D): we sweep 8 values of precision for each of the (weights, activations, attention).

我们在 Dolma V1.7 数据集（Groeneveld 等人, 2024, Soldaini 等人, 2024）上训练和评估了一系列 OLMo 风格的模型，使用标准的 Transformer++ 实现；有关超参数和消融实验，请参见附录 A。我们的实验包括对语言模型预训练运行的一系列扫描，涵盖 $\in \left\lbrack {{30},{60},{110},{220}}\right\rbrack$ 百万参数（非嵌入）和 $\in \left\lbrack {{1.5},3,6,{13},{26}}\right\rbrack$ 十亿令牌。我们的模型尺寸相对较小，因为我们训练到一个非常高的 $\approx {10}^{3}$ ，以研究数据缩放，并在每个 (N,D) 上启动超过 20 次运行：我们对（权重、激活、注意力）的每个精度扫描 8 个值。

3 Scaling Laws for Post-Train Quantization

3 后训练量化的缩放定律

The easiest and most common quantization technique is post-train quantizing a model off-the-shelf Chee et al., 2024, Huang et al., 2024, Dettmers et al., 2022, Lin et al., 2023, Xiao et al., 2023. In this section, we consider models trained in BF16 and use GPTQ [Frantar et al., 2022] to post-train quantize them, replicating our findings with two other methods in Appendix F. We quantify the resulting loss degradation ${\delta }_{\mathrm{{PTQ}}}$ ,finding that post-train quantization scales poorly in data.

最简单且最常用的量化技术是离线后训练量化模型 Chee 等人，2024，Huang 等人，2024，Dettmers 等人，2022，Lin 等人，2023，Xiao 等人，2023。在本节中，我们考虑在 BF16 中训练的模型，并使用 GPTQ [Frantar 等人，2022] 对其进行后训练量化，我们在附录 F 中用另外两种方法复制了我们的发现。我们量化了由此产生的损失退化 ${\delta }_{\mathrm{{PTQ}}}$ ，发现后训练量化在数据中的扩展性较差。

—— 更多内容请到Doc2X翻译查看——

—— For more content, please visit Doc2X for translations ——