- 该文档由Doc2X翻译提供解析与翻译, 想看更多论文翻译欢迎来Doc2X
- This document is provided with parsing and translation by Doc2X. For more translated papers, feel free to visit Doc2X.
- 原文地址 https://ekinakyurek.github.io/papers/ttt.pdf
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
测试时训练对抽象推理的惊人有效性
Ekin Akyürek Mehul Damani Linlu Qiu Han Guo Yoon Kim Jacob Andreas Massachusetts Institute of Technology
Ekin Akyürek Mehul Damani Linlu Qiu Han Guo Yoon Kim Jacob Andreas 麻省理工学院
Abstract
摘要
Language models have shown impressive performance on tasks within their training distribution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT)-updating model parameters temporarily during inference using a loss derived from input data-as a mechanism for improving models’ reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6 × 6 \times 6× improvement in accuracy compared to base fine-tuned models; applying TTT to a 8 B 8\mathrm{\;B} 8B -parameter language model,we achieve 53 % {53}\% 53% accuracy on the ARC’s public validation set, improving the state-of-the-art by nearly 25 % {25}\% 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9 % {61.9}\% 61.9% ,matching the average human score. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models; additional test-time applied to continued training on few-shot examples can also be extremely effective.
语言模型在其训练分布内的任务上表现出色,但在需要复杂推理的新问题上往往表现不佳。我们研究了测试时训练(TTT)——即在推理过程中使用从输入数据派生的损失暂时更新模型参数——作为一种提高模型推理能力的机制,使用抽象和推理语料库(ARC)作为基准。通过系统的实验,我们确定了成功TTT的三个关键组成部分:(1)在相似任务上的初始微调(2)辅助任务格式和增强(3)逐实例训练。TTT显著提高了ARC任务的表现,与基础微调模型相比,准确率提高了高达 6 × 6 \times 6×;将TTT应用于 8 B 8\mathrm{\;B} 8B参数语言模型,我们在ARC的公开验证集上达到了 53 % {53}\% 53%的准确率,将公开和纯神经方法的现有技术水平提高了近 25 % {25}\% 25%。通过将我们的方法与最近的程序生成方法集成,我们获得了SoTA公开验证准确率 61.9 % {61.9}\% 61.9%,与人类平均得分相匹配。我们的发现表明,显式符号搜索并不是提高神经语言模型抽象推理能力的唯一途径;在少量样本上进行持续训练的额外测试时应用也能非常有效。
1 Introduction
1 引言
Large-scale neural language models (LMs) excel at performing tasks that occur in their training data, and often elementary variations or compositions of those tasks (Brown et al., 2020; Todd et al., 2024). Given natural language task specifications or a small number of examples, LMs often successfully infer the desired task and produce an appropriate output. But can LMs also solve new problems, involving non-trivial reasoning, planning, or string manipulation of a kind very different from their pre-training data? This question is central to understanding the novel skill acquisition capabilities of current AI systems, which has been proposed as a key measure of intelligence (Chollet, 2019). For complex and novel tasks, it is often difficult to obtain a correct answer simply by sampling from an LM (Wu et al., 2023). However, a significant finding in recent years has been that LM performance can be substantially improved by augmenting LM decoding with additional test-time computation. Methods in this category include chain-of-thought prompting (Wei et al., 2022), sampling with majority voting (self-consistency; Wang et al., 2022), code execution (Brown et al., 2024; Snell et al., 2024; Damani et al., 2024), and search (Yao et al., 2024).
大规模神经语言模型(LMs)擅长执行其训练数据中出现的任务,以及这些任务的基本变体或组合(Brown等人,2020年;Todd等人,2024年)。给定自然语言任务规格或少量示例,LMs通常能够成功推断出所需任务并生成适当的输出。但是,LMs是否也能解决新问题,涉及与预训练数据截然不同的非平凡推理、规划或字符串操作呢?这个问题对于理解当前人工智能系统的新技能获取能力至关重要,这已被提出作为衡量智能的关键指标(Chollet,2019年)。对于复杂和新颖的任务,通常很难仅通过从LM中采样获得正确答案(Wu等人,2023年)。然而,近年来一个重要发现是,通过在LM解码过程中增加额外的测试时计算,可以显著提高LM的性能。此类方法包括思维链提示(Wei等人,2022年)、多数投票采样(自我一致性;Wang等人,2022年)、代码执行(Brown等人,2024年;Snell等人,2024年;Damani等人,2024年)和搜索(Yao等人,2024年)。
Figure 1: (Left): Pass@2 accuracy on a subset of 80 randomly selected ARC validation tasks. TTT boosts the performance of fine-tuned models (FT) by up to 6 × 6 \times 6× ,with consistent improvements across different model sizes. (Right): Example of a task that the model successfully solves only after applying TTT. Full dataset results in Section 6.
图1:(左):在80个随机选择的ARC验证任务子集上的Pass@2准确率。TTT将微调模型(FT)的性能提高了高达 6 × 6 \times 6×,在不同模型大小上均有持续改进。(右):一个示例任务,模型仅在应用TTT后才成功解决。完整数据集结果见第6节。
One scaling strategy that has gained recent attention is test-time training (TTT), in which models are updated through explicit gradient steps based on test-time inputs (Krause et al., 2018; 2019). This method differs from standard fine-tuning as it operates in an extremely low-data regime-typically via an unsupervised objective on a single input, or a supervised objective applied to one or two in-context labeled examples. Modern versions of this approach was proposed for vision models by Sun et al. (2020), and also applied to sequence models by Gandelsman et al. (2022). The design space for TTT approaches is large, and there is currently a limited understanding of which design choices are most effective for LMs (and specifically for novel-task learning). In this paper, we systematically study the impact of various TTT design choices, as well as its interaction with pre-training and sampling schemes.
最近受到关注的一种缩放策略是测试时训练(TTT),在这种方法中,模型通过基于测试时输入的显式梯度步骤进行更新(Krause等人,2018;2019)。这种方法与标准的微调不同,因为它在极低数据环境下运行——通常通过单个输入的无监督目标,或应用于一个或两个上下文中有标签示例的监督目标。现代版本的这种方法由Sun等人(2020)提出用于视觉模型,并由Gandelsman等人(2022)应用于序列模型。TTT方法的设计空间很大,目前对哪些设计选择对LM(特别是对新任务学习)最有效有有限的理解。在本文中,我们系统地研究了各种TTT设计选择的影响,以及其与预训练和采样方案的相互作用。
We evaluate these methods in the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019), a collection of extremely challenging few-shot visual reasoning problems. ARC is an ideal benchmark for testing the limits of LM generalization as it presents novel tasks, in a novel format, requiring nontrivial search and inference capabilities. Current language models perform poorly on ARC. Most successful approaches have relied on program synthesis techniques (Butt et al., 2024; Ainooson et al., 2023; Huang et al., 2023), though recently Cole et al. (2024) reported promising results using TTT on the benchmark.
我们在抽象和推理语料库(ARC)(Chollet,2019)中评估了这些方法,这是一个包含极具挑战性的少样本视觉推理问题的集合。ARC是测试LM泛化极限的理想基准,因为它以新颖的格式呈现新任务,需要非平凡的搜索和推理能力。当前的语言模型在ARC上的表现不佳。大多数成功的方法依赖于程序合成技术(Butt等人,2024;Ainooson等人,2023;Huang等人,2023),尽管最近Cole等人(2024)报告了在基准测试中使用TTT的令人鼓舞的结果。
We identify several crucial ingredients for effective application of TTT to few-shot learning: (1) initial fine-tuning on synthetic tasks similar to those encountered at test time, (2) an augmented, leave-one-out task generation strategy for constructing the test-time dataset, (3) per-instance adapter training and (4) a self-consistency (Wang et al., 2022) approach under invertible transformations. With careful choices of these components, TTT can significantly improve LM performance on ARC—increasing accuracy by up to a factor of six over a 1B model, and achieving state-of-the-art results for published, purely neural models on the ARC task with a 8B model. Indeed, our results show that when equipped with test-time training, ordinary LMs can match or exceed the performance of many neuro-symbolic approaches on ARC.
我们识别了几个将 TTT 有效应用于小样本学习的关键要素:(1)在类似于测试时遇到的合成任务上进行初始微调,(2)一种增强的、留一法任务生成策略用于构建测试时数据集,(3)每个实例的适配器训练和(4)在可逆变换下的自一致性(Wang 等人,2022)方法。通过仔细选择这些组件,TTT 可以显著提高 LM 在 ARC 上的性能——将 1B 模型的准确性提高多达六倍,并且使用 8B 模型在 ARC 任务上实现了已发表纯神经模型的最新结果。事实上,我们的结果表明,当配备测试时训练时,普通 LM 可以匹配或超过许多神经符号方法在 ARC 上的性能。
Our main contributions 1 {}^{1} 1 are:
我们的主要贡献 1 {}^{1} 1 是:
-
We identify and systematically analyze the key components needed for test-time training on ARC tasks, with a a novel test time training data generation and self-consistency component.
-
我们识别并系统地分析了 ARC 任务测试时训练所需的关键组件,包括一种新颖的测试时训练数据生成和自一致性组件。
-
We achieve state-of-the-art results among published neural approaches on the ARC validation set:
-
我们在 ARC 验证集上实现了已发表神经方法的最新结果:
-
53 % {53}\% 53% accuracy on the public validation set with a 8 B 8\mathrm{\;B} 8B parameter model.
-
53 % {53}\% 53% 在公共验证集上的准确率,使用 8 B 8\mathrm{\;B} 8B 参数模型。
-
61.9 % {61.9}\% 61.9% accuracy when ensembled with program synthesis approaches,matching average human performance on the dataset.
-
61.9 % {61.9}\% 61.9% 当与程序合成方法集成时的准确率,匹配数据集上平均人类性能。
-
We demonstrate that tasks that could only be solved by program synthesis previously can be solved with fully neural approaches equipped with our TTT framework.
-
我们证明了以前只能通过程序合成解决的任务可以通过配备我们 TTT 框架的纯神经方法解决。
These results challenge the assumption that symbolic components are strictly necessary for solving such complex tasks. Instead, they suggest that the critical factor in solving novel reasoning problems may be the allocation of proper computational resources during test time, perhaps independently of whether these resources are deployed through symbolic or neural mechanisms.
这些结果挑战了符号组件对于解决此类复杂任务是严格必要的假设。相反,它们表明解决新推理问题的关键因素可能是测试时适当计算资源的分配,或许独立于这些资源是通过符号机制还是神经机制部署。
2 Preliminaries
2 预备知识
In this section, we first formally describe the ARC challenge. Next, we give an overview of in-context learning and test-time training, which form the foundation of our investigation. Finally, we detail our default experimental setup.
在本节中,我们首先正式描述了ARC挑战。接下来,我们概述了上下文学习和测试时训练,这些构成了我们研究的基础。最后,我们详细介绍了我们的默认实验设置。
1 {}^{1} 1 Our implementation can be found at this link.
1 {}^{1} 1 我们的实现可以在此链接找到。
2.1 ARC Challenge
2.1 ARC挑战
The Abstraction and Reasoning Corpus (ARC) aims to evaluate the abstract reasoning capabilities of language models through their ability to solve visual puzzles. Each puzzle, henceforth referred to as task, is comprised of input-output pairs of 2-D grids (up to 30 × 30 {30} \times {30} 30×30 in size) that contain shapes or patterns made with up to 10 different colors, as displayed in Fig. 1(b). The output of each pair is obtained by applying an intuitive and shared transformation rule or function y = f ( x ) y = f\left( x\right) y=f(x) . In practice,these transformations are highly diverse and composite, ranging from simple concepts such as reflection and counting, to more complex ones such as application of gravity and path finding.
抽象与推理语料库(ARC)旨在通过语言模型解决视觉谜题的能力来评估其抽象推理能力。每个谜题,以下简称任务,由输入-输出对的2-D网格组成(大小可达 30 × 30 {30} \times {30} 30×30),其中包含用最多10种不同颜色制成的形状或图案,如图1(b)所示。每对的输出是通过应用直观且共享的转换规则或函数 y = f ( x ) y = f\left( x\right) y=f(x)获得的。实际上,这些转换非常多样且复合,从简单的概念如反射和计数,到更复杂的概念如重力应用和路径查找。
Each task in ARC is composed of a training and test split, with:
ARC中的每个任务由训练和测试分割组成,具体如下:
-
Training examples denoted ( x k train , y k train ) k = 1 K {\left( {x}_{k}^{\text{train }},{y}_{k}^{\text{train }}\right) }_{k = 1}^{K} (xktrain ,yktrain )k=1K (typically K K K ranges from 2 to 7 ).
-
训练示例标记为 ( x k train , y k train ) k = 1 K {\left( {x}_{k}^{\text{train }},{y}_{k}^{\text{train }}\right) }_{k = 1}^{K} (xktrain ,yktrain )k=1K(通常 K K K的范围为2到7)。
-
Test examples denoted ( x m test , y m test ) m = 1 M {\left( {x}_{m}^{\text{test }},{y}_{m}^{\text{test }}\right) }_{m = 1}^{M} (xmtest ,ymtest )m=1M (typically M M M ranges from 1 to 3 ).
-
测试示例标记为 ( x m test , y m test ) m = 1 M {\left( {x}_{m}^{\text{test }},{y}_{m}^{\text{test }}\right) }_{m = 1}^{M} (xmtest ,ymtest )m=1M(通常 M M M的范围为1到3)。
Given the set of training examples,the goal is to predict the test output y test {y}^{\text{test }} ytest for test test input x test {x}^{\text{test }} xtest by reasoning about the underlying transformation.
给定训练示例集,目标是通过推理底层转换来预测测试输入 x test {x}^{\text{test }} xtest 的测试输出 y test {y}^{\text{test }} ytest 。
We denote a task as d = ( x train , y train , x test , y test ) d = \left( {{\mathbf{x}}^{\text{train }},{\mathbf{y}}^{\text{train }},{\mathbf{x}}^{\text{test }},{\mathbf{y}}^{\text{test }}}\right) d=(xtrain ,ytrain ,xtest ,ytest ) where d ∈ D A R C d \in {\mathcal{D}}_{\mathrm{{ARC}}} d∈DARC ,the collection of such ARC tasks. The original training and validation sets of ARC dataset,respectively D A R C train {\mathcal{D}}_{\mathrm{{ARC}}}^{\text{train }} DARCtrain and D A R C val {\mathcal{D}}_{\mathrm{{ARC}}}^{\text{val }} DARCval ,consists of 400 tasks each. Success criteria requires to produce exact match for all test outputs (if not partial points are given). Please refer to Johnson et al. (2021) for a taxonomy and analysis of these tasks.
我们将任务标记为 d = ( x train , y train , x test , y test ) d = \left( {{\mathbf{x}}^{\text{train }},{\mathbf{y}}^{\text{train }},{\mathbf{x}}^{\text{test }},{\mathbf{y}}^{\text{test }}}\right) d=(xtrain ,ytrain ,xtest ,ytest ),其中 d ∈ D A R C d \in {\mathcal{D}}_{\mathrm{{ARC}}} d∈DARC,此类ARC任务的集合。ARC数据集的原始训练和验证集,分别为 D A R C train {\mathcal{D}}_{\mathrm{{ARC}}}^{\text{train }} DARCtrain 和 D A R C val {\mathcal{D}}_{\mathrm{{ARC}}}^{\text{val }} DARCval ,各包含400个任务。成功标准要求对所有测试输出产生完全匹配(如果不完全则给予部分分数)。请参考Johnson等人(2021)对这些任务的分类和分析。
Most approaches to ARC can be categorized into two main categories: program synthesis and fully neural. Program synthesis approaches (Butt et al., 2024; Wang et al., 2024; Li et al., 2024; Greenblatt, 2024) try to first find the transformation function f f f ,and later apply it to the test example. On the other hand,fully neural approaches (Thoms et al.,2023; Bober-Irizar and Banerjee,2024) try to directly predict the output y test {y}^{\text{test }} ytest ,only implicitly reasoning about the underlying transformation. In this work, we use a fully neural approach, using a LM to predict the test outputs.
大多数针对ARC的方法可以分为两大类:程序合成和全神经网络。程序合成方法(Butt等人,2024;Wang等人,2024;Li等人,2024;Greenblatt,2024)首先尝试找到转换函数 f f f,然后将其应用于测试示例。另一方面,全神经网络方法(Thoms等人,2023;Bober-Irizar和Banerjee,2024)尝试直接预测输出 y test {y}^{\text{test }} ytest ,仅隐式地推理底层转换。在本工作中,我们使用全神经网络方法,利用语言模型(LM)来预测测试输出。
We start with an LM pre-trained on text data (without a vision encoder). To provide ARC examples as input to these models, we thus require a formatting function (denoted str) that converts 2D grids into their textual representations as shown in Appendix A.3. Previous work has presented examples as lists of numbers (Wang et al., 2024) or color words, or lists of connected components labeled with shapes and locations (Greenblatt, 2024). Given any such string representation of a task, we may present it to an LM and perform predictions with few-short prompting, as explained in the next section.
我们从在文本数据上预训练的语言模型(不包含视觉编码器)开始。为了将这些ARC示例作为输入提供给这些模型,我们需要一个格式化函数(记为str),它将2D网格转换为它们的文本表示,如附录A.3所示。先前的工作已将示例呈现为数字列表(Wang等人,2024)或颜色词,或标记有形状和位置的连接组件列表(Greenblatt,2024)。给定任何此类任务字符串表示,我们可以将其呈现给语言模型,并通过少量提示进行预测,具体如下一节所述。
2.2 In-context Learning
2.2 上下文学习
At a certain scale, many LMs exhibit the ability to adapt to new tasks without updating their parameters by simply conditioning on input examples or instructions provided. Given a sequence of input-output pairs ( x 1 , y 1 ) , … , ( x n , y n ) \left( {{x}_{1},{y}_{1}}\right) ,\ldots ,\left( {{x}_{n},{y}_{n}}\right) (x1,y1),…,(xn,yn) and a new input x n + 1 {x}_{n + 1} xn+1 ,a LM can be used to generate the output y ^ n + 1 {\widehat{y}}_{n + 1} y n+1 by sampling from:
在一定规模上,许多语言模型表现出无需更新参数即可通过简单条件化输入示例或提供的指令来适应新任务的能力。给定一系列输入-输出对 ( x 1 , y 1 ) , … , ( x n , y n ) \left( {{x}_{1},{y}_{1}}\right) ,\ldots ,\left( {{x}_{n},{y}_{n}}\right) (x1,y1),…,(xn,yn)和一个新输入 x n + 1 {x}_{n + 1} xn+1,语言模型可以通过从以下采样生成输出 y ^ n + 1 {\widehat{y}}_{n + 1} y n+1:
In-context learning does not resemble any standard machine learning algorithm (Zhao et al., 2024; Min et al., 2022), and it does not work out-of-the box for novel tasks - e.g. small language models (few billion parameters) performs poorly on ARC (Opielka et al., 2024; Bober-Irizar and Banerjee, 2024).
上下文学习不像任何标准的机器学习算法(Zhao等,2024;Min等,2022),并且它不能直接用于新任务——例如,小型语言模型(几十亿参数)在ARC上的表现不佳(Opielka等,2024;Bober-Irizar和Banerjee,2024)。
2.3 Test-Time Training
2.3 测试时训练
Test-time training (TTT) enables parametric models to adapt during inference through dynamic parameter updates, an approach that remains relatively unexplored in the era of large language models. This technique is a form of transductive learning, where models leverages the test data structure to improve its predictions. The general TTT process works as follows: Starting with initial model parameters θ 0 {\theta }_{0} θ0 ,for each test input (or batch of inputs),we first generate training data D TTT ( d input ) {\mathcal{D}}_{\text{TTT}}\left( {d}_{\text{input}}\right) DTTT(dinput) from the test inputs. We then optimize these parameters to minimize a loss function L ( D TTT ; θ ) \mathcal{L}\left( {{\mathcal{D}}_{\text{TTT }};\mathbf{\theta }}\right) L(DTTT ;θ) ,producing temporarily updated parameters θ d {\mathbf{\theta }}_{d} θd for prediction. After generating predictions,the model is restored to the original parameters θ 0 {\theta }_{0} θ0 for the next instance or batch. Thus, TTT trains a specialized prediction model for each test input, obtained by fine-tuning a base model on a test-time dataset generated from that test input.
测试时训练(TTT)使参数化模型能够通过动态参数更新在推理过程中进行适应,这种方法在大语言模型时代仍然相对未被探索。这项技术是一种归纳学习形式,模型利用测试数据结构来改进其预测。TTT的一般过程如下:从初始模型参数 θ 0 {\theta }_{0} θ0 开始,对于每个测试输入(或输入批次),我们首先从测试输入生成训练数据 D TTT ( d input ) {\mathcal{D}}_{\text{TTT}}\left( {d}_{\text{input}}\right) DTTT(dinput)。然后,我们优化这些参数以最小化损失函数 L ( D TTT ; θ ) \mathcal{L}\left( {{\mathcal{D}}_{\text{TTT }};\mathbf{\theta }}\right) L(DTTT ;θ),产生临时更新的参数 θ d {\mathbf{\theta }}_{d} θd 用于预测。生成预测后,模型恢复到原始参数 θ 0 {\theta }_{0} θ0 以处理下一个实例或批次。因此,TTT为每个测试输入训练一个专门的预测模型,通过在从该测试输入生成的测试时数据集上微调基础模型获得。
Figure 2: TTT dataset generation for a test task (Section 3.1): We start by creating leave-one-out tasks from the given training examples of the task. These tasks are then augmented through rule-based transformations to obtain the full TTT dataset. Finally, we train task-specific LoRA adapters on top of the base FT model.
图2:测试任务的TTT数据集生成(第3.1节):我们从给定任务训练示例中创建留一任务开始。这些任务通过基于规则的转换进行增强,以获得完整的TTT数据集。最后,我们在基础FT模型之上训练特定任务的LoRA适配器。
In past work (e.g. Sun et al.,2020), D TTT {\mathcal{D}}_{\text{TTT }} DTTT is typically constructed by applying an unsupervised objective (e.g. masked autoencoding) to the input x \mathbf{x} x alone. However,the in-context learning setting we consider provides richer context in the form of demonstration pairs ( x 1 , y 1 ) , … , ( x K , y K ) \left( {{x}_{1},{y}_{1}}\right) ,\ldots ,\left( {{x}_{K},{y}_{K}}\right) (x1,y1),…,(xK,yK) . Here,applying test-time tuning involves first constructing an initial language model LM,mapping each test input x x x to an input-specific dataset D TTT {\mathcal{D}}_{\text{TTT }} DTTT ,fine-tuning the LM to optimize some loss function L \mathcal{L} L over the dataset according to: ∑ d ∈ D TTT L ( LM ( d ) ) \mathop{\sum }\limits_{{d \in {\mathcal{D}}_{\text{TTT }}}}\mathcal{L}\left( {\operatorname{LM}\left( d\right) }\right) d∈DTTT ∑L(LM(d)) ,and finally sampling from the updated model to obtain a final prediction. Our experiments in this paper characterize each component of this pipeline, describing:
在过去的工作中(例如 Sun 等人,2020 年), D TTT {\mathcal{D}}_{\text{TTT }} DTTT 通常是通过将无监督目标(例如掩码自编码)应用于输入 x \mathbf{x} x 本身来构建的。然而,我们所考虑的上下文内学习设置提供了更丰富的上下文,形式为演示对 ( x 1 , y 1 ) , … , ( x K , y K ) \left( {{x}_{1},{y}_{1}}\right) ,\ldots ,\left( {{x}_{K},{y}_{K}}\right) (x1,y1),…,(xK,yK)。在这里,应用测试时调优涉及首先构建一个初始语言模型 LM,将每个测试输入 x x x 映射到特定于输入的数据集 D TTT {\mathcal{D}}_{\text{TTT }} DTTT ,微调 LM 以优化数据集上的某些损失函数 L \mathcal{L} L,根据: ∑ d ∈ D TTT L ( LM ( d ) ) \mathop{\sum }\limits_{{d \in {\mathcal{D}}_{\text{TTT }}}}\mathcal{L}\left( {\operatorname{LM}\left( d\right) }\right) d∈DTTT ∑L(LM(d)),并最终从更新后的模型中采样以获得最终预测。本文的实验描述了该流程的每个组成部分,具体包括:
-
How to construct the augmented TTT dataset D TTT {\mathcal{D}}_{\text{TTT }} DTTT from the test input (Section 3).
-
如何从测试输入构建增强的 TTT 数据集 D TTT {\mathcal{D}}_{\text{TTT }} DTTT (第 3 节)。
-
An augmented inference strategy based on self-consistency over transformations (Section 4).
-
基于转换上自一致性的增强推理策略(第 4 节)。
-
A base model with parameters θ 0 {\mathbf{\theta }}_{0} θ0 that is fine-tuned on a dataset D F T {\mathcal{D}}_{\mathrm{{FT}}} DFT of similar tasks (Section 5).
-
一个具有参数 θ 0 {\mathbf{\theta }}_{0} θ0 的基础模型,该模型在相似任务的数据集 D F T {\mathcal{D}}_{\mathrm{{FT}}} DFT 上进行了微调(第 5 节)。
2.4 Experimental Setup
2.4 实验设置
To investigate the impact of each TTT component, we conduct experiments by varying one component while holding the others constant at their optimal values (described in their respective sections). Our default configuration in the experiments uses the following settings:
为了调查每个 TTT 组件的影响,我们通过改变一个组件而保持其他组件在其最优值(在各自章节中描述)不变来进行实验。实验中的默认配置使用以下设置:
Model Architecture & Optimization We use an 8B parameter language model from the Llama-3 models, and 1B, 3B from Llama-3.2 models (Dubey et al., 2024). We use Low-Rank Adaptation (LoRA) (Hu et al., 2021) for parameter-efficient test-time training. For each task d d d ,we initialize a separate set of LoRA parameters that are trained on the dataset D TTT {\mathcal{D}}_{\text{TTT }} DTTT . The LoRA rank is set to 128,and adaptations are applied to MLP,attention, and output layers. We train models with AdamW optimizer (Loshchilov and Hutter, 2019) with 2 epochs with batch sizes of 2 .
模型架构与优化 我们使用了来自Llama-3模型的80亿参数语言模型,以及来自Llama-3.2模型的10亿和30亿参数模型(Dubey等人,2024年)。我们使用低秩适应(LoRA)(Hu等人,2021年)进行参数高效的测试时训练。对于每个任务 d d d,我们初始化一组单独的LoRA参数,这些参数在数据集 D TTT {\mathcal{D}}_{\text{TTT }} DTTT 上进行训练。LoRA的秩设置为128,适应应用于MLP、注意力层和输出层。我们使用AdamW优化器(Loshchilov和Hutter,2019年)进行2个epoch的训练,批量大小为2。
Data & Formatting For efficient evaluation purposes, we randomly pick 80 balanced ARC tasks from ARC validation set, includes 20 easy, 20 medium, 20 hard, 20 expert tasks according to the classification in LeGris et al. (2024a) (see Appendix A. 2 for this task list). We will use this subset of ARC tasks throughout the paper, except our final results given in for the full validation set (Section 6). We limit D TTT {\mathcal{D}}_{\text{TTT }} DTTT to have maximum of 250 examples per task for efficiency reasons. With that, the whole TTT and inference process takes approximately 12 hours for 100 randomly sampled validation tasks when using an NVIDIA-A100 GPU. Appendix B. 2 provides additional details on the hyper-parameters. Input grids are converted to text using numpy’s default array printing format as shown in Fig. 8.
数据与格式化 为了高效评估,我们从ARC验证集中随机选取了80个平衡的ARC任务,包括20个简单任务、20个中等任务、20个困难任务和20个专家任务,根据LeGris等人(2024a)的分类(见附录A.2的任务列表)。我们将在整篇论文中使用这部分ARC任务,除了在第6节中给出的完整验证集的最终结果。为了效率,我们将 D TTT {\mathcal{D}}_{\text{TTT }} DTTT 限制为每个任务最多250个示例。这样,整个TTT和推理过程在使用NVIDIA-A100 GPU时,对100个随机采样的验证任务大约需要12小时。附录B.2提供了超参数的额外细节。输入网格使用numpy的默认数组打印格式转换为文本,如图8所示。
—— 更多内容请到Doc2X翻译查看——
—— For more content, please visit Doc2X for translations ——