TOKENFORMER: RETHINKING TRANSFORMER SCAL- ING WITH TOKENIZED MODEL PARAMETERS 翻译

最新推荐文章于 2025-11-24 23:34:07 发布

原创

最新推荐文章于 2025-11-24 23:34:07 发布 · 1k 阅读

17 ·

CC 4.0 BY-SA版权

文章标签：

#transformer #深度学习 #人工智能

批量处理 PDF 文档，就选 Doc2X
支持大规模 PDF 转 Word、Markdown、HTML，集成表格与多栏解析，提升工作效率。
Batch Process PDFs with Doc2X
Handle large-scale PDF to Word, Markdown, or HTML conversions with integrated table and multi-column parsing for better efficiency.
👉 点击查看 Doc2X | View Doc2X

原文链接：https://arxiv.org/pdf/2410.23168

TOKENFORMER: RETHINKING TRANSFORMER SCAL- ING WITH TOKENIZED MODEL PARAMETERS

TOKENFORMER：重新思考使用标记化模型参数的变换器扩展

Haiyang Wang ${}^{1,3}$ ,Yue Fan ${}^{1}$ ,Muhammad Ferjad Naeem ${}^{2}$ ,Yongqin Xian ${}^{2}$ ,

Jan Eric Lenssen ${}^{1}$ ,Liwei Wang ${}^{3}$ ,Federico Tombari ${}^{2}$ ,Bernt Schiele ${}^{1}$

${}^{1}$ Max Planck Institute for Informatics ${}^{2}$ Google ${}^{3}$ Peking University

${}^{1}$ 马克斯·普朗克信息学研究所 ${}^{2}$ 谷歌 ${}^{3}$ 北京大学

{haiwang, schiele}@mpi-inf.mpg.de

ABSTRACT

摘要

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce Tokenformer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from ${124}\mathrm{M}$ to ${1.4}\mathrm{\;B}$ parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at

变换器已成为基础模型中的主要架构，因为它们在各个领域表现出色。然而，扩展这些模型的巨大成本仍然是一个重要问题。这个问题主要源于它们对线性投影中固定参数数量的依赖。当引入架构修改（例如，通道维度）时，整个模型通常需要从头开始重新训练。随着模型规模的不断增长，这种策略导致计算成本越来越高，变得不可持续。为了解决这个问题，我们引入了Tokenformer，这是一种本质上可扩展的架构，利用注意力机制不仅用于输入标记之间的计算，还用于标记与模型参数之间的交互，从而增强了架构的灵活性。通过将模型参数视为标记，我们用我们的标记-参数注意力层替换了变换器中的所有线性投影，其中输入标记作为查询，模型参数作为键和值。这种重新表述允许逐步和高效的扩展，而无需从头开始重新训练。我们的模型通过逐步添加新的键-值参数对，从 ${124}\mathrm{M}$ 扩展到 ${1.4}\mathrm{\;B}$ 参数，达到了与从头训练的变换器相当的性能，同时大大降低了训练成本。代码和模型可在

https://github.com/Haiyang-W/TokenFormer

1 INTRODUCTION

1 引言

Designing a powerful neural network architecture is a long-standing goal in machine learning. Recent developments in foundation models (FMs) have shown the potential of Transformers (Vaswani et al. 2017) as a universal computational architecture. Thanks to their flexibility and scalability, Transformers have achieved state-of-the-art performance across various domains, including natural language processing (NLP) (Radford et al. 2018; Alec et al., 2019; Brown et al., 2020), visual modeling (Dosovitskiy et al., 2021; Liu et al., 2021), vision-language (Liu et al., 2023; Wang et al., 2024), graph representation (Ying et al., 2021), and 3D vision (Wang et al., 2023a b).

设计一个强大的神经网络架构是机器学习中的一个长期目标。基础模型（FMs）的最新发展显示了变换器（Vaswani et al. 2017）作为一种通用计算架构的潜力。由于其灵活性和可扩展性，变换器在多个领域中实现了最先进的性能，包括自然语言处理（NLP）（Radford et al. 2018；Alec et al.，2019；Brown et al.，2020）、视觉建模（Dosovitskiy et al.，2021；Liu et al.，2021）、视觉-语言（Liu et al.，2023；Wang et al.，2024）、图表示（Ying et al.，2021）和3D视觉（Wang et al.，2023a b）。

Transformers typically divide the computation required to process a single token into two distinct parts: interactions with other input tokens (token-token interaction) and computations involving the model’s parameters (token-parameter interaction). The attention mechanism (Vaswani et al. 2017) facilitates token-token interactions, allowing modern general-purpose foundation models to encode multi-modal data into a unified token sequence and effectively capture complex dependencies among them (Liu et al., 2023; Zhu et al., 2023; Wang et al., 2023d). Conversely, token-parameter computations rely heavily on linear projections (Dunford & Schwartz, 1988), where input tokens are multiplied by a fixed set of parameters. This prescribed design limits scalability because increasing the model size requires altering core architectural components, often necessitating retraining the entire model from scratch. As models grow larger, this results in excessive resource consumption, making it increasingly impractical. In this paper, we introduce a novel architecture that enhances the flexibility of token-parameter interactions, allowing for incremental scaling of model parameters and effectively reusing previously trained models, thus significantly reducing the training burden.

Transformers 通常将处理单个标记所需的计算分为两个不同的部分：与其他输入标记的交互（标记-标记交互）和涉及模型参数的计算（标记-参数交互）。注意机制（Vaswani et al. 2017）促进了标记-标记交互，使现代通用基础模型能够将多模态数据编码为统一的标记序列，并有效捕捉它们之间的复杂依赖关系（Liu et al., 2023; Zhu et al., 2023; Wang et al., 2023d）。相反，标记-参数计算在很大程度上依赖于线性投影（Dunford & Schwartz, 1988），其中输入标记与一组固定的参数相乘。这种规定的设计限制了可扩展性，因为增加模型大小需要改变核心架构组件，通常需要从头开始重新训练整个模型。随着模型规模的增大，这导致资源消耗过多，使其变得越来越不切实际。在本文中，我们引入了一种新颖的架构，增强了标记-参数交互的灵活性，允许模型参数的增量扩展，并有效地重用先前训练的模型，从而显著降低训练负担。

最低0.47元/天解锁文章