大模型笔记6 数据蒸馏相关论文梳理

Θ是网络参数初始化的分布，l是训练网络的损失函数，例如，用于分类的交叉熵损失，T表示内部迭代的次数，以及η是内部循环的学习率。性能匹配的目标表示一种双层优化算法：在内部循环中，参数为θ更新为𝒮通过 gradient decent，并且递归计算图被缓存;在外部循环中，在内部循环之后训练的模型将在𝒯验证损失通过展开的计算图反向传播到𝒮. 基于这种目标的 DD 方法类似于元学习和基于梯度的超参数优化技术.

其中梯度更新使用反向梯度优化算法(Domke, 2012; Maclaurin et al., 2015)

所以是内部用S迭代训练T次获得网络参数θ, 外部再用T测试性能获得损失, 反向梯度更新S

这种方法外部优化步骤的计算成本很高，并且所需的 GPU 内存与内部循环的数量成正比。因此内部循环无法得到充分训练.

参数匹配（Parameter Matching）

思想是在某些步骤中分别使用合成数据集和原始数据集训练同一网络，并使它们训练的神经参数一致。根据训练步骤数，可以进一步分为单步参数匹配和多步参数匹配两大流。

1.单步参数匹配也称为梯度匹配

目标函数:

其中 metric𝒟测量梯度之间的距离∇l(𝒮;θ(t))和∇l(𝒯;θ(t)).梯度之间的距离可以使用负余弦相似度等计算. 由于只需要单步梯度，并且合成数据和网络的更新是分开的，因此与基于元学习的性能匹配相比，这种方法内存效率高。

2. 多步参数匹配

仅匹配单步梯度，因此在评估中可能会累积误差

MTT匹配训练轨迹

θ将从T训练轨迹的检查点进行初始化和采样。在S上训练模型Ts步和T上训练Tt步，其中Ts和Tt是超参数，并试图最小化这两个轨迹的终点的距离，即θ𝒮(Ts)和θ𝒯(Tt):

和单步主要的区别是每次初始参数校准, 在每段末尾匹配

分布匹配（Distribution Matching）

由于图像信息是高维的, 直接估计数据分布可能开销大且不准确. 为了获取分布采用一组神经网络(嵌入函数)获得其嵌入，每个函数提供对输入的部分解释，它们的组合提供完整的解释。这里，我们将参数函数表示为fθ，分布匹配定义为：

其中𝒟是测量两个分布之间距离的某个度量

目标是每类(c)合成数据集和真实数据集的输出嵌入分布接近

如D使用MMD度量, 则为嵌入的中心（即均值向量μ）接近

网络更新过程

参考IDC一节中的介绍

在参数匹配中这个问题尤其严重, 比如MTT在反向传播期间，必须为每个 \( T \) 更新计算梯度。这意味着计算图必须保留轨迹中每一步的所有中间状态和梯度

Unroll

TESLA中将同一次迭代中需要复用的矩阵保存下来

KIP 用优化样本特征空间取代内循环的模型训练过程

传统方法是计算unroll的computational graph.

这种方法在计算梯度和更新合成数据集的过程中，需要完整地遍历网络的每一步更新操作。这种情况下，计算整个外循环（包括网络更新和合成数据集更新）所花费的时间与计算损失函数的次数成正比，原因在于计算损失函数的过程本身就嵌入在网络更新的每一步中，且每一步更新都需要计算损失函数来确定更新的方向和幅度。例如，在基于梯度的优化算法中，每一次参数更新都依赖于当前的损失函数值及其对参数的梯度，所以在完整展开的网络更新过程中，计算损失函数成为了更新步骤不可或缺的一部分，导致二者所需时间一致。

KIP: 核岭回归和 NTK 来获得内部问题的封闭形式解，从而将原始的双层优化简化为单层优化问题。这种方法称为 KIP，由于 NTK 计算需要数千个 GPU 小时。

FrePo:（Zhou 等人）仅认为最后一层的神经网络参数是可学习的，同时保持其他参数固定。通过这种近似，FrePo 能够获得岭回归的封闭形式解。尽管 FrePo 比 KIP 更快，但它仍然需要存储所有计算图和繁重的矩阵求逆操作

TESLA (Scaling up dataset distillation to imagenet-1k with constant memory)中将同一次迭代中需要复用的矩阵保存下来, 使得计算复杂度与轨迹匹配步数T无关.

其内存需求仅取决于一些固定因素(合成数据集大小, 模型架构)

其次, 有一些方法, 比如数据集匹配DM不更新网络, KIP基于内核岭回归（KRR）避免内循环训练的方法（如 KIP）, 用优化样本特征空间取代内循环的模型训练过程

合成数据参数化

Parameterization

Synthetic data parameterization

将合成数据进行编码以减少储存开销, 或进行变换, 实现数据增强.

参数化

Differentiable siamese augmentation(DSA 3.4.1)

一般来说，对于合成数据集，编码z

以与原始形式不同的形式存储。函数g

以参数φ将d’维的编码映射回原始图像的格式用于下游训练。在这种情况下，合成数据集表示为：

因为数据生成过程是可微分的, 训练时可以利用反向传播𝒮把φ和𝒵以端到端方式更新

在下图中, z就是合成数据生成的格式, 它与正常图片(或训练时候使用的图片)不同, 通过这个映射函数gΦ将它映射回正常的图片格式, 再喂给下游的模型进行训练.

DSA:

可微分孪生增强（DSA）是一组旨在提高数据效率的增强策略，包括crop [1], cutout [98], flip, scale, rotate, and color jitters [1]操作。它首先由 Zhao 等人应用于数据集蒸馏任务。 [23]以提高数据效率，从而提高泛化性。这里gφ(⋅)是一系列图像变换，参数化为φ∼Φ, 其中φ是数据增强的参数，编码z仍然保持与 Raw 图像相同的格式。它是可微分的，因此可以通过梯度下降来优化合成数据。此外，每次迭代中合成样本和真实样本的数据增强参数相同(代表操作相同)。该技术已应用于许多以下 DD 方法，例如 DM[24]、MTT[81]TESLA[95]等。

学习标签的方式

Label

标记蒸馏Label Distillation Methods

TDD

https://ar5iv.org/html/1910.02551#S4.F7

为每个合成样本分配一个 “软 ”标签（标签的分布）

较大网络的输出标签不是“硬”标签。因为它们通常是 softmax 层的输出，所以输出标签在可能的类上形成概率分布。其思路是，任何训练图像都包含有关多个类的信息（例如，数字 '3' 的图像看起来很像其他数字 '3'，但它看起来也像数字 '8'）。使用“软”标签可以让我们传达有关关联图像的更多信息。

例子:

每个图像都标有其前 3 个类及其关联的 logit。仅显示 10 个步骤中的 3 个。

经典方法: DC、DSA、DM、MTT和FRePo

存在问题: 计算成本, 提高压缩比(不适用大模型, 大数据集), 跨架构泛化

代表算法

剩下两个方法也要从这几个方面分析, 特别是优化目标

S2L的方法虽然属于核心集, 但是优化目标也是轨迹匹配

PDD基于MTT, 属于DD, 优化目标也是轨迹匹配

标签都没有学习

PDD的参数化和网络更新方法应该和MTT一致

S2L应该不涉及这两个,

我多余的那个应该也是属于选择, 不过优化目标是分布匹配

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

vita实验室方向有LLM的高效训练推理，优化和图像生成

VITA

杨雨的主页：https://sites.google.com/g.ucla.edu/yuyang/home

其研究理解和改进大规模训练数据，以实现高效和稳健的学习。

代码似乎暂时未放

https://github.com/YuYang0901/S2L

通过总结小模型的训练轨迹来进行数据选择

该方法记录训练轨迹，对小模型的训练轨迹进行聚类, 从每个簇中采样来选择数据，其原理是相似训练损失的样本具有相似价值，优势在于可减少计算和存储成本。

训练轨迹“trajectory” 指在模型训练过程中，某个数据点（如训练示例）随着训练时间或训练步骤的推进，其相关指标（如损失值）的变化路径。

小型模型（Pythia-70M）上相同训练轨迹集群中的示例在大型模型（Pythia-2.8B）上也有相似的训练轨迹，即使两个模型上的趋势可能不同。

对比实验中得到关于不同类型轨迹对效果影响的几个结论:

轨迹的长度对效果影响不大

选取轨迹的时间(训练的早中晚期)影响不大

轨迹的稀疏影响较大

实验中考虑指标:confidence, perplexity, and learnability.

方法

在之前的分类中属于数据选择方法, 优化目标为参数匹配

思想:

1.损失轨迹出现相似的下降, 则可能他们包含相同难度的知识.

2.Small-to-Large Data Selection. 从小到大的数据选择。

样本在小模型中训练动态相似->在大模型训练动态相似

当目标模型的大小很大时，训练与目标模型大小相同的参考模型并获取每个示例的特征表示可能成本很高。

大多数示例的训练动态在不同大小的模型中是一致的，可以通过聚类在更大的模型上找到具有相似训练动态的示例组小模型的训练轨迹，即使趋势不同。

模型大小有410M to 2.8B, 架构(Pythia, Phi)

目标是选择一个子集来优化目标模型在整个训练集上的性能

其中 θ是在子集 S上训练的模型, B是给定的数据预算。在实践中，子集是基于参数为Φ的参考模型r来选择的，该模型为每个数据点

生成表示

。数据选择算法利用这些表示来生成S 。Φ通常是预训练或微调的目标模型的权重, 但他们生成的表示并不一定好.

S2L利用小模型上训练示例的训练轨迹来优化数据选择目标。

Training Trajectory. 训练轨迹

S2L 中使用的每个数据点的表示是在训练参考模型期间记录的一系列损失

整个右边是内部迭代时候微调的目标损失函数

Φt为 t时刻 D_train上训练时小型 LM 的参数。

P参数为Φ时, 给定x生成y的条件概率

T是训练轨迹的长度。t=1~T

Cluster-based Data Selection. 基于聚类的数据选择

应用聚类算法根据损失轨迹的相似性对样本进行分组。这会产生一组 cluster ，其中每个 cluster 包含在训练时期内具有相似损失轨迹的示例：

其中 L表示示例 (x,y)的损失轨迹，LCi是簇Ci 中损失轨迹的质心，d是用于聚类的距离度量，例如欧几里得距离。如图2所示，聚类算法可以有效地找到具有相似训练动态的示例组。

同一个cluster的轨迹

不同cluster的轨迹

集群构造过程中，数据选择策略优先从较小的集群中选择示例，然后从较大的集群中选择相同个数示例，如算法 1 中详述。

实验

其中临床医学文本总结(clinical text summarization task)这个实验使用3个指标进行评估. 横轴3个柱的名称代表和随机采样以及完整数据集的微调效果进行对比, 纵轴代表评估指标的值, 其中的起始点为微调之前模型的效果.

BLEU

ROUGE-L

BERTScore

S2L轨迹的长度对效果影响不大, 轨迹长时略好一些

选取轨迹的时间(训练的早中晚期)影响不大, 轨迹的稀疏影响较大(其中稀疏轨迹包含训练全程)

Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality

[2310.06982] Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality

https://github.com/VITA-Group/ProgressiveDD

多次蒸馏提高数据质量

其中CIFAR-10模型的实验利用了完整数据集的 5%，但实现了相当完整数据集90%的性能。

可以看成MTT的优化, 多次蒸馏

类别中属于数据蒸馏+参数匹配

方法

思想

1.模型学习从易到难:

模型在训练迭代中, 初始阶段的学习接近线性函数, 后期学习的函数的复杂性不断增加。

这意味着基于早期训练动态生成的合成示例只能训练低复杂度神经网络，这些神经网络在可通过低复杂度模型分离的简单示例上表现良好。最近的研究进一步支持了这一限制，该研究观察到深度模型从学习各个训练阶段难度级别不断增加的示例中获益最多.

2.分阶段合成数据:

渐进式数据集蒸馏（PDD）使用多个合成子集来捕获训练不同阶段的训练动态。多个小型合成图像集一起训练模型.

3.每次蒸馏保留之前合成样本

合成Si时候保留S1~Si-1, 在此基础上继续生成, 防止重复捕获冗余信息, 且训练可以在不同子集之间平滑过渡. (好奇具体加之前S继续生成的代码流程, 之后有空github找找)

4.后期的蒸馏丢弃T中容易(遗忘分数低)的例子

遗忘分数（Toneva 等人），定义为训练期间每个示例的预测从正确变为错误的次数。具有较高遗忘分数的示例是稍后在具有较高复杂性函数的训练期间学习的。另一方面，遗忘分数非常低的例子是那些可以在训练早期通过较低复杂性函数进行分类的例子。在每个蒸馏阶段，我们都会丢弃遗忘分数较低的示例，并将蒸馏重点放在难度不断增加的示例上，通过遗忘分数来衡量。正如我们将通过实验证实的那样，这提高了 PDD 的效率而不损害性能。

PDD框架

PDD由多个蒸馏阶段和过渡阶段组成。

1.蒸馏阶段: 根据前一阶段合成的图像提取一组新的图像。

2.过渡阶段: 用所有合成图像训练模型，作为下一个蒸馏阶段的起始权重

对于i=1~P的每一个阶段

实验

在多个蒸馏阶段用ConvNets 在 CIFAR-10、CIFAR-100 和 Tiny-ImageNet 上对通过 PDD + MTT 提取的样本进行多阶段训练后获得的测试准确率，其中每个阶段的 IPC 增大。红线分别表示在 CIFAR-10、CIFAR-100 和 Tiny-ImageNet 上对完整数据进行训练的表现。

左：CIFAR-10 上的表现；中：CIFAR-100 上的表现；右：Tiny-ImageNet 上的表现。

PDD 大大缩小了与使用完整数据集进行训练的差距:

在 CIFAR-10 上仅使用 5%样本（即 IPC =250）即可实现完整数据90% 的准确率，在 CIFAR-100 上仅使用10%（即 IPC =50）即可实现完整数据的 90% 的准确率。

在 Tiny-ImageNet 上，将PDD + MTT在提取每个类别 50 张图像后，也可以达到使用完整数据进行训练所获得性能的 80%。

跨网络架构的泛化实验:

在ConvNet提取的样本也能提升ResNet的性能

蒸馏时候丢弃T中容易样本与不丢弃样本的效果对比, 丢弃后可以减小蒸馏的开销

实验设置都是网络ConvNet, 数据集CIFAR-10, 合成集每类样本数10

蒸馏时通过遗忘分数选择样本与完整真实集蒸馏的性能对比:

Dataset Distillation by Matching Training Trajectories

[2203.11932] Dataset Distillation by Matching Training Trajectories

https://github.com/GeorgeCazenavette/mtt-distillation

Dataset Distillation by Matching Training Trajectories | CVPR 2022_哔哩哔哩_bilibili

George Cazenavette, MTT

一个重要假设：完整数据集训练模型时候的参数序列称为 “专家轨迹”，因为它们代表了数据集蒸馏任务的理论上限：在完整的真实数据集上训练的网络的性能

通过匹配训练轨迹进行数据蒸馏:

在蒸馏合成数据的训练和真实数据的训练之间执行远程参数匹配。从相同的初始参数开始，我们训练提炼的数据 𝒟𝗌𝗒𝗇，使提炼数据上的 N 训练步骤与来自真实数据上更多 M 步骤的相同结果（在参数空间中）相匹配。

其中在每个蒸馏步骤中，首先在随机时间步 θt∗ 从我们的专家轨迹之一中采样参数，并使用这些参数来初始化学生参数 θ^t≔θt∗。

训练后, 计算专家轨迹M步后和学生网络参数N步后之间的L2误差，使合成数据在模型上训练的参数轨迹逼近完整数据集的参数轨迹。(M>N)

报告视频

Dataset Distillation

Datasets used in Modern Deep Learning is Huge(!)Huge dataset increases training cost, e.g, taking longer time There are privacy issues in huge dataset.

Large storage is required.

Can we approximate Huge dataset by small dataset?

One Example: Coreset

Coreset prone to relying on Heuristic Method. (since the original problem is NP-hard)

It confines itself to the samples existing in the original dataset.

Dataset Distillation

DD synthesizes a small dataset that approximates the large dataset from scratch!

For now, the classification task is presumed !

…

T is real dataset, S is synthetic dataset

Goal: The model trained using synthetic dataset should have high performance when tested in real training data.

Previous works

Goal: To generate synthetic dataset that approximate the original datasetHow can we solve this?(1) Performance Matching(2) Gradient Matching

(3) Distribution Matching

Avoid expensive computation stemmingfrom bi-level optimizationPerformance degrades from SOTA

To summarize..

Performance matching is short-sightedand is difficult to optimize

Gradient matching has the similar issue.While distribution matching alleviate this,its performance degrades.

Method

Idea: Match the weights in real data training and synthetic data training!

1. Train n networks using the given data and saveweights per every epoch.

2. Initialize the current network by choosing from collected weights.

3. Train the current networlby synthetic dataset.

4. Match the weights in step 3 to the succeeding weights in real data training

Method(In detail)

Collect T1 ...Tn real data training trajectories: t = (0 Set current weights t - 0t by sampling from the collected trajectories,where t < T

Gradient-descend N-steps using Dsyn to obtain 0t+N

Obtain 0t+M from the training trajectory and minimize the following:

L2 loss between parametersL2 loss between parameters with Regularization !

1) Initialize synthetic datasetusing the portion of real

data

Sampling from syntheticdataset during N-step GD This reduces the memoryconsumption significantly

3) DiffAugmentation

Notice the learning rate isalso learnt!

(DSIR)Data Selection for Language Models via Importance Resampling

https://arxiv.org/abs/2302.03169

用类似数据选择类迁移/领域适配的方法，可以提高数据质量. 给定少量目标文本（如网页中的正文）

用KL reduction度量所选预训练数据与目标特征空间的接近程度, 从大数据集中采样, 使其接近目标数据集分布.

KL reduction:

IDC: Dataset condensation with gradient matching

PDD中使用MTT和IDC

[2006.05929] Dataset Condensation with Gradient Matching

https://github.com/VICO-UoE/DatasetCondensation

该方法优化了找到最佳合成图像集的过程.

传统的方法是嵌套循环优化并在每次迭代时求解内循环θs(S)以恢复梯度从而获得S，这需要计算量大的过程 - 在针对 𝜽 的多个优化步骤中展开 𝒮 的递归计算图(unrolling the recursive computation graph)

Please use some formulas to explain the process of adjusting the synthetic set S and compare it with general backpropagation and hyperparameter optimization.

To compute \(\nabla L(S)\) when the update of the synthetic set \(S\) is based on performance matching, the approach focuses on ensuring that the performance of the model using the synthetic set closely matches that of the model trained on the full dataset. Here's a structured breakdown of the computation process:

### 1. Define the Performance Matching Loss Function \(L(S)\)

The loss function \(L(S)\) in the context of performance matching can be defined as follows:

L(S) = \sum_{j=1}^{M} \left( \mathcal{L}(f(S; \theta), y_j) - \mathcal{L}(f(D; \theta), y_j) \right)^2

Where:

- \(f(S; \theta)\) is the model's prediction using the synthetic set \(S\).

- \(f(D; \theta)\) is the model's prediction using the full dataset \(D\).

- \(y_j\) are the true labels corresponding to the data points.

### 2. Compute the Gradient \(\nabla L(S)\)

To compute \(\nabla L(S)\), we utilize the following steps:

#### Step 1: Compute the Gradient of the Loss Function

Using the chain rule, the gradient of the loss function with respect to the synthetic set \(S\) can be expressed as:

\nabla L(S) = \sum_{j=1}^{M} 2 \left( \mathcal{L}(f(S; \theta), y_j) - \mathcal{L}(f(D; \theta), y_j) \right) \cdot \nabla \mathcal{L}(f(S; \theta), y_j) \cdot \nabla f(S; \theta)

Where:

- \(\nabla \mathcal{L}(f(S; \theta), y_j)\) is the gradient of the loss with respect to the model's output for the synthetic set.

- \(\nabla f(S; \theta)\) is the gradient of the model's output with respect to the synthetic set \(S\).

#### Step 2: Calculate the Gradient of the Model Output

The gradient \(\nabla f(S; \theta)\) can be computed based on the architecture of the model and its parameters. This typically involves:

\nabla f(S; \theta) = \frac{\partial f(S; \theta)}{\partial S}

This calculation depends on the model's architecture (e.g., neural networks, decision trees, etc.) and how the inputs \(S\) affect the outputs.

### 3. Combine the Results

By substituting the computed gradients back into the equation for \(\nabla L(S)\):

\nabla L(S) = \sum_{j=1}^{M} 2 \left( \mathcal{L}(f(S; \theta), y_j) - \mathcal{L}(f(D; \theta), y_j) \right) \cdot \nabla \mathcal{L}(f(S; \theta), y_j) \cdot \nabla f(S; \theta)

### 4. Update the Synthetic Set \(S\)

The synthetic set \(S\) is then updated using the computed gradient:

S' = S - \eta \nabla L(S)

Where \(\eta\) is a learning rate.

### Summary

To summarize, the computation of \(\nabla L(S)\) involves:

1. Defining a performance matching loss function that measures how well the outputs of the model using \(S\) match those of the full dataset.

2. Computing the gradient of this loss function with respect to the synthetic set \(S\) using the chain rule.

3. Updating the synthetic set based on the computed gradients to iteratively improve the performance matching.

This approach allows for efficient optimization of the synthetic dataset to ensure it serves as a good approximation of the larger dataset's performance.

Please compare the S update process of Training Trajectories matching and performance matching, and then refer to paper -- Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory (https://ar5iv.org/html/2211.10586) --to compare what information needs to be saved in the computational graph of these two methods.

To compare the **Gradient Backpropagation** processes of **Training Trajectories Matching (MTT)** and **Performance Matching**, we need to delve into how each method computes gradients and how this affects their computational graphs.

### Gradient Backpropagation in MTT

1. **Unrolling Updates:**

- MTT involves unrolling the optimization process for \( T \) steps. For each update, the model parameters are adjusted based on the synthetic data:

\hat{\theta}_{t+i} = \hat{\theta}_{t+i-1} - \beta \nabla \ell(\hat{\theta}_{t+i-1}; \tilde{X}_i)

2. **Storing Gradients:**

- During backpropagation, gradients must be computed for each of the \( T \) updates. This means that the computational graph must retain all intermediate states and gradients for every step in the trajectory:

- For each \( i \) from 1 to \( T \), the gradients are calculated and stored:

\nabla \ell(\hat{\theta}_{t+i}; \tilde{X}_i)

- This leads to a large computational graph that encompasses the entire trajectory.

3. **Loss Calculation:**

- The final loss is computed based on the last model parameter and requires gradients from all previous updates to evaluate how well the synthetic training matches the performance of the model trained on real data:

\mathcal{L} = \frac{\|\hat{\theta}_{t+T} - \theta_{t+M}^*\|^2_2}{\|\theta_{t}^* - \theta_{t+M}^*\|^2_2}

4. **Memory Complexity:**

- The memory usage is \( O(T) \) because all gradients from each step must be preserved. This can quickly become prohibitive for large \( T \) and complex datasets like ImageNet-1K.

### Gradient Backpropagation in Performance Matching

1. **Single Update Process:**

- Performance Matching simplifies the process by not unrolling multiple updates. Instead, it typically computes the gradient based on the current state of the model after a defined number of updates:

\theta_{new} = \theta - \beta \nabla \ell(\theta; \tilde{X})

- This means that only the current model parameters and the synthetic data are used for gradient computation.

2. **Performance Metrics:**

- The gradient is influenced by performance metrics, comparing the synthetic model's performance to the reference model, but it does not require storing every single intermediate state:

\text{Performance} = \ell(\theta; \tilde{X})

3. **Simplified Graph:**

- Since Performance Matching does not track the entire trajectory, the computational graph is significantly simpler. It primarily retains the current model parameters and the gradients computed from those parameters.

4. **Memory Efficiency:**

- The memory complexity is considerably lower, generally \( O(1) \) or minimal, since only the current parameters and a limited amount of historical performance data are stored.

### Summary of Differences in Computational Graphs

- **MTT:**

- **Complexity:** High complexity due to the need to store gradients for every trajectory step.

- **Memory Usage:** \( O(T) \), which can lead to significant memory overhead, especially with large \( T \).

- **Graph Structure:** A large and interconnected graph that captures the evolution of the model parameters across all updates.

- **Performance Matching:**

- **Complexity:** Low complexity, focusing on the current model state and its performance.

- **Memory Usage:** \( O(1) \), as it only retains necessary information for the current gradient calculation.

- **Graph Structure:** A simpler, more linear graph that only connects the current parameters and performance metric calculations.

### Conclusion

The differences in the Gradient Backpropagation processes between MTT and Performance Matching fundamentally impact their computational graphs. MTT's requirement to store all intermediate states leads to a complex and memory-intensive graph, while Performance Matching’s approach results in a simpler and more efficient graph, allowing for better scalability, particularly for large datasets.

在反向传播期间，必须为每个 \( T \) 更新计算梯度。这意味着计算图必须保留轨迹中每一步的所有中间状态和梯度

参考:

Kegan GG Samuel and Marshall F Tappen. Learning optimized map estimates in continuously-valued mrf models. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 477–484. IEEE, 2009.(这篇的公式流程清楚一些)

Learning optimized MAP estimates in continuously-valued MRF models | IEEE Conference Publication | IEEE Xplore

http://vigir.missouri.edu/~gdesouza/Research/Conference_CDs/IEEE_CVPR_2009/data/papers/0453.pdf

In the hyper-parameter learning work, the authors were able to use implicit differentiation to compute the gradient of a loss function with respect to hyper-parameters.

在超参数学习工作中，作者能够使用隐式微分来计算损失函数关于超参数的梯度。

超参数优化参考:

https://www.youtube.com/watch?v=KuW6jznDzDI

它一般也是双层优化问题

内层固定λ求最优θ, 外层固定θ求最优λ

其中可有l对λ梯度的可以根据负梯度方向优化

Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pp. 318–326, 2012.

此处提到的方法是back optimization methods反向优化方法, 当时应用于图像去噪和超参数优化. 它们使用隐式微分implicit differentation方法来计算损失函数相对于超参数的梯度.

例子(其中N代表优化迭代次数):

IDC优化后的目标为:

DSA

B. Zhao and H. Bilen, “Dataset condensation with differentiable siamese augmentation,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 674–12 685.

[2102.08259] Dataset Condensation with Differentiable Siamese Augmentation

一般的DD或DC问题:

针对每一个样本x计算网络T和S的平均损失, 其中ϕ代表参数为θ的神经网络

一般的想法是让S上训练的参数接近T训练的参数. 但是获得𝒮涉及对网络参数的嵌套循环优化𝜽和合成数据𝒮这通常无法扩展到大型模型和多步骤优化。

因此, 假设每次迭代都可以获得相似的参数, 在每次迭代的初始值使用T训练的参数. 通过使用参数为𝜽的单个神经网络来简化学习

D是与每个层的每个输出节点关联的两个权重梯度之间的余弦距离之和

“Siamese” 指的是一种连体结构或暹罗结构的设计理念，具体应用于数据增强策略中，称为 “Differentiable Siamese Augmentation（DSA）”。

在 DSA 中，Siamese 结构表现为对采样的真实数据和合成数据应用相同的随机采样数据变换（例如在每次训练迭代中，对真实批次和合成批次中的所有数据点应用相同的旋转、裁剪等参数化变换）。这种共享变换的方式使得合成图像能够从真实图像中学习到先验知识（如物体通常的水平位置等），更有效地利用真实训练图像中的信息，并将其以更有序的方式提炼到合成图像中，避免信息平均化，从而提升合成数据的质量及网络训练效果。

在学习合成图像的同时应用数据增强