批量处理 PDF 文档,就选 Doc2X
支持大规模 PDF 转 Word、Markdown、HTML,集成表格与多栏解析,提升工作效率。
Batch Process PDFs with Doc2X
Handle large-scale PDF to Word, Markdown, or HTML conversions with integrated table and multi-column parsing for better efficiency.
👉 点击查看 Doc2X | View Doc2X
原文链接:https://arxiv.org/pdf/2410.23168
TOKENFORMER: RETHINKING TRANSFORMER SCAL- ING WITH TOKENIZED MODEL PARAMETERS
TOKENFORMER:重新思考使用标记化模型参数的变换器扩展
Haiyang Wang 1 , 3 {}^{1,3} 1,3 ,Yue Fan 1 {}^{1} 1 ,Muhammad Ferjad Naeem 2 {}^{2} 2 ,Yongqin Xian 2 {}^{2} 2 ,
Haiyang Wang 1 , 3 {}^{1,3} 1,3 ,Yue Fan 1 {}^{1} 1 ,Muhammad Ferjad Naeem 2 {}^{2} 2 ,Yongqin Xian 2 {}^{2} 2 ,
Jan Eric Lenssen 1 {}^{1} 1 ,Liwei Wang 3 {}^{3} 3 ,Federico Tombari 2 {}^{2} 2 ,Bernt Schiele 1 {}^{1} 1
Jan Eric Lenssen 1 {}^{1} 1 ,Liwei Wang 3 {}^{3} 3 ,Federico Tombari 2 {}^{2} 2 ,Bernt Schiele 1 {}^{1} 1
1 {}^{1} 1 Max Planck Institute for Informatics 2 {}^{2} 2 Google 3 {}^{3} 3 Peking University
1 {}^{1} 1 马克斯·普朗克信息学研究所 2 {}^{2} 2 谷歌 3 {}^{3} 3 北京大学
{haiwang, schiele}@mpi-inf.mpg.de
{haiwang, schiele}@mpi-inf.mpg.de
ABSTRACT
摘要
Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce Tokenformer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124 M {124}\mathrm{M} 124M to 1.4 B {1.4}\mathrm{\;B} 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at
变换器已成为基础模型中的主要架构,因为它们在各个领域表现出色。然而,扩展这些模型的巨大成本仍然是一个重要问题。这个问题主要源于它们对线性投影中固定参数数量的依赖。当引入架构修改(例如,通道维度)时,整个模型通常需要从头开始重新训练。随着模型规模的不断增长,这种策略导致计算成本越来越高,变得不可持续。为了解决这个问题,我们引入了Tokenformer,这是一种本质上可扩展的架构,利用注意力机制不仅用于输入标记之间的计算,还用于标记与模型参数之间的交互,从而增强了架构的灵活性。通过将模型参数视为标记,我们用我们的标记-参数注意力层替换了变换器中的所有线性投影,其中输入标记作为查询,模型参数作为键和值。这种重新表述允许逐步和高效的扩展,而无需从头开始重新训练。我们的模型通过逐步添加新的键-值参数对,从 124 M {124}\mathrm{M} 124M 扩展到 1.4 B {1.4}\mathrm{\;B} 1.4B 参数,达到了与从头训练的变换器相当的性能,同时大大降低了训练成本。代码和模型可在
https://github.com/Haiyang-W/TokenFormer
1 INTRODUCTION
1 引言
Designing a powerful neural network architecture is a long-standing goal in machine learning. Recent developments in foundation models (FMs) have shown the potential of Transformers (Vaswani et al. 2017) as a universal computational architecture. Thanks to their flexibility and scalability, Transformers have achieved state-of-the-art performance across various domains, including natural language processing (NLP) (Radford et al. 2018; Alec et al., 2019; Brown et al., 2020), visual modeling (Dosovitskiy et al., 2021; Liu et al., 2021), vision-language (Liu et al., 2023; Wang et al., 2024), graph representation (Ying et al., 2021), and 3D vision (Wang et al., 2023a b).
设计一个强大的神经网络架构是机器学习中的一个长期目标。基础模型(FMs)的最新发展显示了变换器(Vaswani et al. 2017)作为一种通用计算架构的潜力。由于其灵活性和可扩展性,变换器在多个领域中实现了最先进的性能,包括自然语言处理(NLP)(Radford et al. 2018;Alec et al.,2019;Brown et al.,2020)、视觉建模(Dosovitskiy et al.,2021;Liu et al.,2021)、视觉-语言(Liu et al.,2023;Wang et al.,2024)、图表示(Ying et al.,2021)和3D视觉(Wang et al.,2023a b)。
Transformers typically divide the computation required to process a single token into two distinct parts: interactions with other input tokens (token-token interaction) and computations involving the model’s parameters (token-parameter interaction). The attention mechanism (Vaswani et al. 2017) facilitates token-token interactions, allowing modern general-purpose foundation models to encode multi-modal data into a unified token sequence and effectively capture complex dependencies among them (Liu et al., 2023; Zhu et al., 2023; Wang et al., 2023d). Conversely, token-parameter computations rely heavily on linear projections (Dunford & Schwartz, 1988), where input tokens are multiplied by a fixed set of parameters. This prescribed design limits scalability because increasing the model size requires altering core architectural components, often necessitating retraining the entire model from scratch. As models grow larger, this results in excessive resource consumption, making it increasingly impractical. In this paper, we introduce a novel architecture that enhances the flexibility of token-parameter interactions, allowing for incremental scaling of model parameters and effectively reusing previously trained models, thus significantly reducing the training burden.
Transformers 通常将处理单个标记所需的计算分为两个不同的部分:与其他输入标记的交互(标记-标记交互)和涉及模型参数的计算(标记-参数交互)。注意机制(Vaswani et al. 2017)促进了标记-标记交互,使现代通用基础模型能够将多模态数据编码为统一的标记序列,并有效捕捉它们之间的复杂依赖关系(Liu et al., 2023; Zhu et al., 2023; Wang et al., 2023d)。相反,标记-参数计算在很大程度上依赖于线性投影(Dunford & Schwartz, 1988),其中输入标记与一组固定的参数相乘。这种规定的设计限制了可扩展性,因为增加模型大小需要改变核心架构组件ÿ