Doc2X:Markdown 转换工具专家
Doc2X 提供专业 PDF 转 Markdown 服务,支持表格解析、多栏布局和代码提取,优化工作流程。
Doc2X: Markdown Conversion Tool Specialist
Doc2X offers professional PDF to Markdown services with table parsing, multi-column layout, and code extraction to optimize workflows.
👉 访问 Doc2X 官网 | Visit Doc2X Official Site
原文链接:https://arxiv.org/pdf/2410.10076
VideoAgent: Self-Improving Video Generation
VideoAgent: 自我改进的视频生成
Achint Soni1* Sreyas Venkataraman 2* Abhranil Chandra 1*
Achint Soni1* Sreyas Venkataraman 2* Abhranil Chandra 1*
Sebastian Fischmeister 1 {}^{1}\; 1 Percy Liang 4 {}^{4}\; 4 Bo Dai 3 , 5 {}^{3,5}\; 3,5 Sherry Yang 3 , 4 , 6 {}^{3,4,6} 3,4,6
Sebastian Fischmeister 1 {}^{1}\; 1 Percy Liang 4 {}^{4}\; 4 Bo Dai 3 , 5 {}^{3,5}\; 3,5 Sherry Yang 3 , 4 , 6 {}^{3,4,6} 3,4,6
1 {}^{1} 1 University of Waterloo 2 {}^{2} 2 IIT Kharagpur 3 {}^{3} 3 Google Deepmind
1 {}^{1} 1 滑铁卢大学 2 {}^{2} 2 印度理工学院卡拉格普尔分校 3 {}^{3} 3 谷歌Deepmind
4 {}^{4} 4 Stanford University 5 {}^{5} 5 Georgia Institute of Technology 6 {}^{6} 6 New York University
4 {}^{4} 4 斯坦福大学 5 {}^{5} 5 乔治亚理工学院 6 {}^{6} 6 纽约大学
{achint.s046,sreyasv2002,abhra21c}@gmail.com, sherryy@google.com
{achint.s046,sreyasv2002,abhra21c}@gmail.com, sherryy@google.com
ABSTRACT
摘要
Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, utilizing feedback from a pretrained vision-language model (VLM). As the refined video plan is being executed, VideoAgent collects additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robotics can be an effective tool in grounding video generation in the physical world. Video demos can be found at https://video-as-agent.github.io. 1 {}^{1} 1
视频生成已被用于生成控制机器人系统的视觉计划。给定图像观察和语言指令,先前的工作生成了视频计划,然后将其转换为机器人控制以执行。然而,利用视频生成进行控制的主要瓶颈在于生成视频的质量,这些视频通常包含幻觉内容和不现实的物理现象,导致从生成视频中提取控制动作时任务成功率低。尽管扩大数据集和模型规模提供了部分解决方案,但整合外部反馈对于将视频生成扎根于现实世界既是自然的也是必不可少的。基于这一观察,我们提出了VideoAgent,用于根据外部反馈自我改进生成的视频计划。VideoAgent不是直接执行生成的视频计划,而是首先使用一种称为自条件一致性的新程序对生成的视频计划进行细化,利用来自预训练视觉语言模型(VLM)的反馈。在执行细化后的视频计划时,VideoAgent从环境中收集额外数据以进一步改进视频计划生成。在MetaWorld和iTHOR中的模拟机器人操作实验表明,VideoAgent显著减少了幻觉,从而提高了下游操作任务的成功率。我们进一步说明,VideoAgent可以有效细化真实机器人视频,为机器人成为将视频生成扎根于物理世界的有效工具提供了早期指示。视频演示可在https://video-as-agent.github.io找到。 1 {}^{1} 1
1 INTRODUCTION
1 引言
Large text-to-video models pretrained on internet-scale data have broad applications such as generating creative video content (Ho et al., 2022; Hong et al., 2022; Singer et al., 2022) and creating novel games (Bruce et al., 2024), animations (Wang et al., 2019), and movies (Zhu et al., 2023). Furthermore, recent work show that video generation can serve as simulators of the real-world (Yang et al., 2023b; Brooks et al., 2024), as well as policies with unified observation and action space (Du et al., 2024; Ko et al., 2023; Du et al., 2023). These recent applications of text-to-video generation models hold the great promise of internet-scale knowledge transfer (e.g., from generating human videos to generating robot videos), as well as paving the way to generalist agent (e.g., a single policy that can control multiple robots with different morphologies in different environments to perform diverse tasks).
在互联网规模数据上预训练的大型文本到视频模型具有广泛的应用,例如生成创意视频内容(Ho et al., 2022; Hong et al., 2022; Singer et al., 2022)和创建新颖的游戏(Bruce et al., 2024)、动画(Wang et al., 2019)和电影(Zhu et al., 2023)。此外,最近的研究表明,视频生成可以作为现实世界的模拟器(Yang et al., 2023b; Brooks et al., 2024),以及具有统一观察和动作空间的政策(Du et al., 2024; Ko et al., 2023; Du et al., 2023)。这些文本到视频生成模型的最新应用具有巨大的潜力,可以实现互联网规模的知识的转移(例如,从生成人类视频到生成机器人视频),并为通用代理铺平道路(例如,一个单一的政策可以控制不同形态的多个机器人在不同环境中执行多样化的任务)。
Nevertheless, text-to-video models have only had limited success in downstream applications in reality. For instance, in video generation as policy (Du et al., 2024; Ko et al., 2023), when an observation image and a language instruction are given to a video generation model, generated videos often hallucinate (e.g., objects randomly appear or disappear) or violate physical laws (e.g., a robot hand going through an object) (Yang et al., 2023b; Brooks et al., 2024). Such hallucinations and unrealistic physics have led to low task success rate when generated videos are converted to control actions through inverse dynamics models, goal conditioned policies, or other action extraction mechanisms (Wen et al., 2023; Yang et al., 2024; Ajay et al., 2024).
然而,文本到视频模型在现实中的下游应用中仅取得了有限的成功。例如,在视频生成作为政策(Du et al., 2024; Ko et al., 2023)中,当给定一个观察图像和一个语言指令时,生成的视频通常会出现幻觉(例如,物体随机出现或消失)或违反物理定律(例如,机器人手穿过物体)(Yang et al., 2023b; Brooks et al., 2024)。这种幻觉和不现实的物理现象导致生成的视频通过逆动力学模型、目标条件政策或其他动作提取机制转换为控制动作时,任务成功率较低(Wen et al., 2023; Yang et al., 2024; Ajay et al., 2024)。
*Equal contribution.
*同等贡献。
1 {}^{1} 1 Code available at https://github.com/Video-as-Agent/VideoAgent
1 {}^{1} 1 代码可在 https://github.com/Video-as-Agent/VideoAgent 获取
Figure 1: The VideoAgent Framework. VideoAgent first generates a video plan conditioned on an image observation and task description similar to (Du et al., 2023), and undergoes (1) iterative video refinement using feedback from a vision language model (VLM), (2) using the VLM to select the best refined video plan to convert to control actions through optical flow, and (3) executing the control actions in an environment and improving video generation using real-world feedback and additional data collected online.
图1:VideoAgent框架。VideoAgent首先根据图像观察和任务描述生成视频计划,类似于(Du et al., 2023),然后经历(1)使用视觉语言模型(VLM)的反馈进行迭代视频细化,(2)使用VLM选择最佳细化视频计划,通过光流转换为控制动作,以及(3)在环境中执行控制动作,并使用现实世界的反馈和在线收集的额外数据改进视频生成。
While scaling up dataset and model size can be effective in reducing hallucination in large language models (LLMs) (Hoffmann et al., 2022), scaling is more difficult in video generation models. This is partially because language labels for videos are labor intensive to curate. Moreover, video generation has not converged to an architecture that is more favourable to scaling (Yang et al., 2024). Scaling aside, being able to incorporate external feedback to improve generation is one of the other most important breakthrough in LLMs (Ouyang et al., 2022b). It is therefore natural to wonder what kind of feedback is available for text-to-video models, and how we can incorporate these feedback to further improve the quality of the generated videos.
尽管扩大数据集和模型规模可以有效减少大型语言模型(LLMs)中的幻觉(Hoffmann et al., 2022),但在视频生成模型中进行扩展则更为困难。这在一定程度上是因为视频的语言标签制作成本高昂。此外,视频生成尚未收敛到一个更有利于扩展的架构(Yang et al., 2024)。除了扩展之外,能够整合外部反馈以改进生成是LLMs的另一个最重要突破(Ouyang et al., 2022b)。因此,自然会思考文本到视频模型有哪些可用的反馈,以及我们如何整合这些反馈以进一步提高生成视频的质量。
To answer this question, we explore two types of feedback that are natural to acquire for video generation models, namely AI feedback from a vision-language model (VLM) and real-world execution feedback when generated videos are converted to motor controls. To utilize these feedback for self-improvement, we propose VideoAgent. Different from video generation as policy, which directly turns a generated video into control actions (Du et al., 2023; Ko et al., 2023), VideoAgent is trained to refine a generated video plan iteratively using feedback from a pretrained VLM. During inference, VideoAgent queries the VLM to select the best refined video plan, followed by execution of the plan in the environment. During online execution, VideoAgent observes whether the task was successfully completed and further improves the video generation model based on the execution feedback from the environment and additional data collected from the environment. The improvement to the generated video plan comes in two folds: First, we propose self-conditioning consistency for video diffusion model inspired by consistency models (Song et al., 2023; Heek et al., 2024), which enables low-quality samples from a video diffusion model to be further refined into high-quality samples. Second, when online access to the environment is available, VideoAgent executes the current video policy and collect additional successful trajectories to further finetune the video generation model on the successful trajectories. A visual illustration of VideoAgent is shown in Figure 1.
为了回答这个问题,我们探讨了两种对视频生成模型来说自然可获取的反馈类型,即来自视觉语言模型(VLM)的AI反馈和当生成的视频转化为运动控制时的现实世界执行反馈。为了利用这些反馈进行自我改进,我们提出了VideoAgent。与直接将生成的视频转化为控制动作的视频生成策略不同(Du et al., 2023; Ko et al., 2023),VideoAgent通过使用预训练VLM的反馈来迭代优化生成的视频计划。在推理过程中,VideoAgent查询VLM以选择最佳的优化视频计划,然后在环境中执行该计划。在在线执行过程中,VideoAgent观察任务是否成功完成,并根据来自环境的执行反馈和从环境中收集的额外数据进一步改进视频生成模型。对生成视频计划的改进体现在两个方面:首先,我们提出了受一致性模型启发的视频扩散模型的自条件一致性(Song et al., 2023; Heek et al., 2024),这使得视频扩散模型生成的低质量样本能够进一步优化为高质量样本。其次,当在线访问环境时,VideoAgent执行当前的视频策略并收集额外的成功轨迹,以进一步在成功轨迹上微调视频生成模型。VideoAgent的视觉示意图如图1所示。
We first evaluate the performance of VideoAgent in two simulated robotics manipulation environments, Meta-World (Yu et al., 2020) and iTHOR (Kolve et al., 2017), and show that VideoAgent improves task success across all environments and tasks evaluated. VideoAgent can even improve the success rate of difficult tasks by as much as 4 X 4\mathrm{X} 4X . Next,we provide a thorough study on the effect of different components in VideoAgent, including different ways to prompt the VLM and different types of feedback from the VLM, providing a recipe for utilizing VLM feedback for video generation. Lastly, we illustrate that VideoAgent can iteratively improve real-robot videos, providing early signal that robotics can be an important mean to ground video generation models in the real world.
我们首先评估了 VideoAgent 在两个模拟机器人操作环境(Meta-World(Yu 等,2020)和 iTHOR(Kolve 等,2017))中的性能,并展示了 VideoAgent 在所有评估环境和任务中提高了任务成功率。VideoAgent 甚至可以将困难任务的成功率提高多达 4 X 4\mathrm{X} 4X。接下来,我们深入研究了 VideoAgent 中不同组件的影响,包括提示 VLM 的不同方式和 VLM 的不同类型反馈,为利用 VLM 反馈进行视频生成提供了方法。最后,我们展示了 VideoAgent 可以迭代改进真实机器人视频,提供了早期信号,表明机器人可以成为在现实世界中验证视频生成模型的重要手段。
2 BACKGROUND
2 背景
In this section, we provide the background on video generation as policy in a decision making process (Du et al., 2023). We also introduce consistent diffusion models (Song et al., 2023; Heek et al., 2024; Daras et al., 2024), which VideoAgent builds upon for self-refinement.
在本节中,我们提供了关于视频生成作为决策过程中策略的背景(Du 等,2023)。我们还介绍了一致扩散模型(Song 等,2023;Heek 等,2024;Daras 等,2024),VideoAgent 正是基于这些模型进行自我改进。
2.1 VIDEO AS POLICY IN SEQUENTIAL DECISION MAKING
2.1 视频作为序列决策中的策略
We consider a predictive decision process similar to (Du et al.,2024): P := ⟨ X , G , A , H , E , R ⟩ \mathcal{P} \mathrel{\text{:=}} \langle \mathcal{X},\mathcal{G},\mathcal{A},H,\mathcal{E},\mathcal{R}\rangle P:=⟨X,G,A,H,E,R⟩ , where X \mathcal{X} X denotes an image-based observation space, G \mathcal{G} G denotes textual task description space, A \mathcal{A} A denotes a low-level motor control action space,and H ∈ R H \in \mathbb{R} H∈R denotes the horizon length. We denote π ( ⋅ ∣ x 0 , g ) : X × G ↦ Δ ( X H ) 2 \pi \left( {\cdot \mid {x}_{0},g}\right) : \mathcal{X} \times \mathcal{G} \mapsto \Delta {\left( {\mathcal{X}}^{H}\right) }^{2} π(⋅∣x0,g):X×G↦Δ(XH)2 as the language conditioned video generation policy,which models the probability distribution over H H H -step image sequences x = [ x 0 , … , x H ] \mathbf{x} = \left\lbrack {{x}_{0},\ldots ,{x}_{H}}\right\rbrack x=[x0,…,xH] determined by the first frame x 0 {x}_{0} x0 and the task description g g g . Intuitively, x ∼ π ( ⋅ ∣ x 0 , g ) \mathbf{x} \sim \pi \left( {\cdot \mid {x}_{0},g}\right) x∼π(⋅∣x0,g) correspond to possible visual paths for completing a task g g g . Given a sampled video plan x \mathbf{x} x ,one can use a learned mapping ρ ( ⋅ ∣ x ) : X H ↦ Δ ( A H ) \rho \left( {\cdot \mid \mathbf{x}}\right) : {\mathcal{X}}^{H} \mapsto \Delta \left( {\mathcal{A}}^{H}\right) ρ(⋅∣x):XH↦Δ(AH) to extract motor controls from generated videos through a goal-conditioned policy (Du et al., 2023), diffusion policy (Black et al., 2023), or dense correspondence (Ko et al., 2023). Once a sequence of motor controls a ∈ A H \mathbf{a} \in {\mathcal{A}}^{H} a∈AH are extracted from the video,they are sequentially executed in the environment E \mathcal{E} E ,after which a final reward R : A H ↦ { 0 , 1 } \mathcal{R} : {\mathcal{A}}^{H} \mapsto \{ 0,1\} R:AH↦{0,1} is emitted representing whether the task was successfully completed. For simplicity, we only consider finite horizon, episodic tasks. Given a previously collected dataset of videos labeled with task descriptions D = { ( x , g ) } \mathcal{D} = \{ \left( {\mathbf{x},g}\right) \} D={(x,g)} ,one can leverage behavioral cloning (BC) (Pomerleau,1988) to learn π \pi π by minimizing
我们考虑一个类似于 (Du et al., 2024) 的预测决策过程: P := ⟨ X , G , A , H , E , R ⟩ \mathcal{P} \mathrel{\text{:=}} \langle \mathcal{X},\mathcal{G},\mathcal{A},H,\mathcal{E},\mathcal{R}\rangle P:=⟨X,G,A,H,E,R⟩,其中 X \mathcal{X} X 表示基于图像的观测空间, G \mathcal{G} G 表示文本任务描述空间, A \mathcal{A} A 表示低级运动控制动作空间, H ∈ R H \in \mathbb{R} H∈R 表示时间长度。我们将 π ( ⋅ ∣ x 0 , g ) : X × G ↦ Δ ( X H ) 2 \pi \left( {\cdot \mid {x}_{0},g}\right) : \mathcal{X} \times \mathcal{G} \mapsto \Delta {\left( {\mathcal{X}}^{H}\right) }^{2} π(⋅∣x0,g):X×G↦Δ(XH)2 表示为语言条件下的视频生成策略,它建模了由初始帧 x 0 {x}_{0} x0 和任务描述 g g g 决定的 H H H 步图像序列 x = [ x 0 , … , x H ] \mathbf{x} = \left\lbrack {{x}_{0},\ldots ,{x}_{H}}\right\rbrack x=[x0,…,xH] 的概率分布。直观地说, x ∼ π ( ⋅ ∣ x 0 , g ) \mathbf{x} \sim \pi \left( {\cdot \mid {x}_{0},g}\right) x∼π(⋅∣x0,g) 对应于完成任务 g g g 的可能视觉路径。给定一个采样的视频计划 x \mathbf{x} x,可以使用学习到的映射 ρ ( ⋅ ∣ x ) : X H ↦ Δ ( A H ) \rho \left( {\cdot \mid \mathbf{x}}\right) : {\mathcal{X}}^{H} \mapsto \Delta \left( {\mathcal{A}}^{H}\right) ρ(⋅∣x):XH↦Δ(AH) 通过目标条件策略 (Du et al., 2023)、扩散策略 (Black et al., 2023) 或密集对应关系 (Ko et al., 2023) 从生成的视频中提取运动控制。一旦从视频中提取出一系列运动控制 a ∈ A H \mathbf{a} \in {\mathcal{A}}^{H} a∈AH,它们将依次在环境 E \mathcal{E} E 中执行,之后会发出一个最终奖励 R : A H ↦ { 0 , 1 } \mathcal{R} : {\mathcal{A}}^{H} \mapsto \{ 0,1\} R:AH↦{0,1},表示任务是否成功完成。为简单起见,我们只考虑有限时间、情节任务。给定一个先前收集的带有任务描述标签的视频数据集 D = { ( x , g ) } \mathcal{D} = \{ \left( {\mathbf{x},g}\right) \} D={(x,g)},可以利用行为克隆 (BC) (Pomerleau, 1988) 通过最小化来学习 π \pi π
Equation 1 can be viewed as maximizing the likelihood of the videos in D \mathcal{D} D conditioned on the initial frame and task description.
方程 1 可以被视为最大化视频在 D \mathcal{D} D 中以初始帧和任务描述为条件的似然。
2.2 CONSISTENCY MODELS
2.2 一致性模型
Diffusion models (Ho et al., 2020; Song et al., 2020b) have emerged as an important technique for data distribution modeling. During training, the model learns to map noisy data (at various noise levels) back to clean data in a single step. Concretely,let x ( 0 ) {x}^{\left( 0\right) } x(0) denote a clean image and x ( t ) {x}^{\left( t\right) } x(t) denote the noisy image at noise level t t t ,where t ∈ [ 0 , T ] t \in \left\lbrack {0,T}\right\rbrack t∈[0,T] ,the training objective for a diffusion model f θ ( x ( t ) , t ) {f}_{\theta }\left( {{x}^{\left( t\right) },t}\right) fθ(x(t),t) can be written as
扩散模型(Ho et al., 2020; Song et al., 2020b)已成为数据分布建模的重要技术。在训练过程中,模型学习将噪声数据(在不同噪声水平下)一步映射回干净数据。具体来说,设 x ( 0 ) {x}^{\left( 0\right) } x(0) 表示干净图像, x ( t ) {x}^{\left( t\right) } x(t) 表示在噪声水平 t t t 下的噪声图像,其中 t ∈ [ 0 , T ] t \in \left\lbrack {0,T}\right\rbrack t∈[0,T],扩散模型的训练目标 f θ ( x ( t ) , t ) {f}_{\theta }\left( {{x}^{\left( t\right) },t}\right) fθ(x(t),t) 可以写成
where ϵ ∈ N ( 0 , I ) \epsilon \in \mathcal{N}\left( {0,I}\right) ϵ∈N(0,I) is the added noise,and x ( t ) = α t x ( 0 ) + 1 − α t ϵ {x}^{\left( t\right) } = \sqrt{{\alpha }_{t}}{x}^{\left( 0\right) } + \sqrt{1 - {\alpha }_{t}}\epsilon x(t)=αtx(0)+1−αtϵ where α t {\alpha }_{t} αt are time-dependent noise levels. Although diffusion models have achieved high-quality image/video generation, they require hundreds or thousands of denoising steps during inference, which induces tremendous computational cost. To overcome the slow sampling speed of diffusion models, consistency models (Song et al., 2023; Song & Dhariwal, 2023) were initially proposed by enforcing a consistency loss across different noise levels, i.e.,
其中 ϵ ∈ N ( 0 , I ) \epsilon \in \mathcal{N}\left( {0,I}\right) ϵ∈N(0,I) 是添加的噪声, x ( t ) = α t x ( 0 ) + 1 − α t ϵ {x}^{\left( t\right) } = \sqrt{{\alpha }_{t}}{x}^{\left( 0\right) } + \sqrt{1 - {\alpha }_{t}}\epsilon x(t)=αtx(0)+1−αtϵ 其中 α t {\alpha }_{t} αt 是时间相关的噪声水平。尽管扩散模型在图像/视频生成方面取得了高质量的结果,但它们在推理过程中需要数百或数千次的去噪步骤,这导致了巨大的计算成本。为了克服扩散模型采样速度慢的问题,一致性模型(Song et al., 2023; Song & Dhariwal, 2023)最初通过在不同噪声水平之间强制执行一致性损失来提出,即
which encourages the output of the single-step map between different noise levels to be similar. In fact, both the diffusion loss in Equation 2 and the consistency loss in Equation 3 can be understood as exploiting the structure of the denoising procedure which corresponds to an ordinary differential equation (ODE). Specifically, as introduced in (Song et al., 2023; 2020a), the backward denoising procedure of a diffusion model can be characterized by an ODE, i.e.,
这鼓励了不同噪声水平之间单步映射的输出相似。事实上,方程2中的扩散损失和方程3中的一致性损失都可以理解为利用了去噪过程的结构,该过程对应于一个常微分方程(ODE)。具体来说,正如(Song et al., 2023; 2020a)中所介绍的,扩散模型的反向去噪过程可以由一个ODE来表征,即
with s ( x ( t ) , t ) s\left( {{x}^{\left( t\right) },t}\right) s(x(t),t) is some score function. During the entire path along t ∈ ( ϵ , ∞ ] t \in (\epsilon ,\infty \rbrack t∈(ϵ,∞] ,following this ODE should always maps x t {x}_{t} xt to x 0 {x}_{0} x0 . If we parametrize the model f ( x ( t ) , t ) f\left( {{x}^{\left( t\right) },t}\right) f(x(t),t) as the simulation following the ODE governed by s ( x ( t ) , t ) s\left( {{x}^{\left( t\right) },t}\right) s(x(t),t) ,we obtain the diffusion loss (2). Meanwhile,for all t , t ′ ∈ ( ϵ , ∞ ] t,{t}^{\prime } \in (\epsilon ,\infty \rbrack t,t′∈(ϵ,∞] , we have f ( x ( t ) , t ) = f ( x ( t ′ ) , t ′ ) f\left( {{x}^{\left( t\right) },t}\right) = f\left( {{x}^{\left( {t}^{\prime }\right) },{t}^{\prime }}\right) f(x(t),t)=f(x(t′),t′) along the simulation path,which induces the consistency loss (3). Therefore, we can combine the diffusion loss and the consistency loss together for model training, i.e.,
其中 s ( x ( t ) , t ) s\left( {{x}^{\left( t\right) },t}\right) s(x(t),t) 是某个评分函数。在整个沿着 t ∈ ( ϵ , ∞ ] t \in (\epsilon ,\infty \rbrack t∈(ϵ,∞] 的路径中,遵循这个常微分方程(ODE)应该始终将 x t {x}_{t} xt 映射到 x 0 {x}_{0} x0。如果我们将模型 f ( x ( t ) , t ) f\left( {{x}^{\left( t\right) },t}\right) f(x(t),t) 参数化为遵循由 s ( x ( t ) , t ) s\left( {{x}^{\left( t\right) },t}\right) s(x(t),t) 控制的 ODE 的模拟,我们得到扩散损失(2)。同时,对于所有 t , t ′ ∈ ( ϵ , ∞ ] t,{t}^{\prime } \in (\epsilon ,\infty \rbrack t,t′∈(ϵ,∞],我们在模拟路径上具有 f ( x ( t ) , t ) = f ( x ( t ′ ) , t ′ ) f\left( {{x}^{\left( t\right) },t}\right) = f\left( {{x}^{\left( {t}^{\prime }\right) },{t}^{\prime }}\right) f(x(t),t)=f(x(t′),t′),这引入了一致性损失(3)。因此,我们可以将扩散损失和一致性损失结合起来进行模型训练,即
where λ \lambda λ denotes consistency regularization hyperparameter across different noise levels.
其中 λ \lambda λ 表示跨不同噪声水平的一致性正则化超参数。
3 Video Generation as Agent
3 视频生成作为代理
In this section, we introduce VideoAgent to improve video plan generation. In Section 3.1, we establish a new notion of consistency for video diffusion models, which we called self-conditioning consistency. In Section 3.2, we discuss how the video diffusion model trained with self-conditioning consistency can be used to refine generated video plans during inference. In Section 3.3, we discuss how VideoAgent further closes the self-improvement loop by collecting additional online data to further train the video generation and refinement model.
在本节中,我们介绍 VideoAgent 以改进视频计划生成。在第 3.1 节中,我们为视频扩散模型建立了一个新的概念,称为自条件一致性。在第 3.2 节中,我们讨论了如何使用经过自条件一致性训练的视频扩散模型在推理过程中细化生成的视频计划。在第 3.3 节中,我们讨论了 VideoAgent 如何通过收集额外的在线数据来进一步训练视频生成和细化模型,从而进一步闭合自我改进循环。
2 {}^{2} 2 We use Δ ( ⋅ ) \Delta \left( \cdot \right) Δ(⋅) to denote a probability simplex function
2 {}^{2} 2 我们使用 Δ ( ⋅ ) \Delta \left( \cdot \right) Δ(⋅) 表示概率单纯形函数
3.1 VIDEO REFINEMENT THROUGH SELF-CONDITIONING CONSISTENCY
3.1 通过自条件一致性进行视频细化
We first consider first-frame-and-language conditioned video generation following (Du et al., 2023; Ko et al., 2023), which finds a sequence of image frames for completing the task described by the language starting from the initial image. When a sample is drawn from a video generation model, it is commonly found that a part of the generated video (e.g., the beginning) is realistic while another part of the video (e.g., the end) hallucinates (Yang et al., 2023b). In other words, while a generated video plan may not fully complete the task specified, it provides meaningful information that can be further improved to achieve the correct plan. To leverage such partial progress, we consider a video consistency model to condition on previous self-generated samples to diffuse for the ground truth video, so that the model can learn to preserve the realistic part of the video while refining the hallucinatory part. Specifically,let x ( 0 ) {\mathbf{x}}^{\left( 0\right) } x(0) be a ground truth video,and x ^ \widehat{\mathbf{x}} x be a generated video sample from the diffusion model. We define a self-conditioning consistency model as f ^ θ ( x ^ , x ( t ) , t ) {\widehat{f}}_{\theta }\left( {\widehat{\mathbf{x}},{\mathbf{x}}^{\left( t\right) },t}\right) f θ(x ,x(t),t) ,which takes a generated video x ^ \widehat{\mathbf{x}} x and a noisy version of the ground truth video x ( t ) {\mathbf{x}}^{\left( t\right) } x(t) as input and predicts the clean video.
我们首先考虑基于首帧和语言条件的视频生成(Du et al., 2023; Ko et al., 2023),该方法通过语言描述的任务从初始图像开始找到一系列图像帧。当从视频生成模型中抽取样本时,通常会发现生成的视频中有一部分(例如,开头部分)是现实的,而另一部分(例如,结尾部分)则出现了幻觉(Yang et al., 2023b)。换句话说,尽管生成的视频计划可能无法完全完成指定任务,但它提供了有意义的信息,可以通过进一步改进来实现正确的计划。为了利用这种部分进展,我们考虑使用视频一致性模型,该模型基于先前自生成的样本进行扩散,以生成真实视频,从而使模型能够学习保留视频的现实部分,同时改进幻觉部分。具体来说,设 x ( 0 ) {\mathbf{x}}^{\left( 0\right) } x(0) 为真实视频, x ^ \widehat{\mathbf{x}} x 为从扩散模型生成的视频样本。我们将自条件一致性模型定义为 f ^ θ ( x ^ , x ( t ) , t ) {\widehat{f}}_{\theta }\left( {\widehat{\mathbf{x}},{\mathbf{x}}^{\left( t\right) },t}\right) f θ(x ,x(t),t),它以生成的视频 x ^ \widehat{\mathbf{x}} x 和真实视频的噪声版本 x ( t ) {\mathbf{x}}^{\left( t\right) } x(t) 作为输入,并预测出干净的视频。
We make the observation that self-conditioning can be understood as a reparametrization of the implicit ODE solver for Equation 4 (Song et al., 2020a; Lu et al., 2022; Zhang & Chen, 2022; Chen et al., 2022). For example, Song et al. (2020a) considered the 1st order ODE solver for Equation 4 following
我们观察到,自条件可以理解为对隐式ODE求解器的重新参数化(Song et al., 2020a; Lu et al., 2022; Zhang & Chen, 2022; Chen et al., 2022)。例如,Song et al. (2020a) 考虑了方程4的1阶ODE求解器,如下所示
Figure 2: An illustration of Self-Conditioning Consistency model. Top row: denoising steps. Bottom row: refinement iterations. x ^ ( i + 1 ) {\widehat{\mathbf{x}}}_{\left( i + 1\right) } x (i+1) denotes the generated video plan at refinement iteration ( i + 1 ) . Δ \left( {i + 1}\right) .\Delta (i+1).Δ denotes a fixed number of denoising steps. We condition the refinement on the generated video from the previous iteration x ^ ( i ) {\widehat{\mathbf{x}}}_{\left( i\right) } x (i) .
图2:自条件一致性模型的示意图。上排:去噪步骤。下排:细化迭代。 x ^ ( i + 1 ) {\widehat{\mathbf{x}}}_{\left( i + 1\right) } x (i+1) 表示在细化迭代中生成的视频计划, ( i + 1 ) . Δ \left( {i + 1}\right) .\Delta (i+1).Δ 表示固定的去噪步骤数。我们根据前一次迭代生成的视频来调整细化过程 x ^ ( i ) {\widehat{\mathbf{x}}}_{\left( i\right) } x (i)。
where α t {\alpha }_{t} αt follows the noise scheduling in Equation 2. Higher-order ODE solvers have also been considered in (Lu et al., 2022), all of which depend on the previous sample x ^ \widehat{x} x . Based on this observation, we introduce a previous video sample x ^ \widehat{\mathbf{x}} x into our parametrization of f ^ θ ( x ^ , x ( t ) , t ) {\widehat{f}}_{\theta }\left( {\widehat{\mathbf{x}},{\mathbf{x}}^{\left( t\right) },t}\right) f θ(x ,x(t),t) to mimic these ODE solvers. We emphasize that (Song et al., 2020a; Lu et al., 2022; Zhang & Chen, 2022) still follows the same parametrization for the score function s ( x ( t ) , t ) s\left( {{x}^{\left( t\right) },t}\right) s(x(t),t) ,which is learned by the vanilla diffusion loss. The self-conditioning ODE solver in Equation 6 is only exploited for accelerating generation. A visual illustration of self-conditioning is shown in Figure 2.
其中 α t {\alpha }_{t} αt 遵循方程2中的噪声调度。高阶ODE求解器也在 (Lu et al., 2022) 中被考虑,所有这些都依赖于前一个样本 x ^ \widehat{x} x 。基于这一观察,我们将前一个视频样本 x ^ \widehat{\mathbf{x}} x 引入到 f ^ θ ( x ^ , x ( t ) , t ) {\widehat{f}}_{\theta }\left( {\widehat{\mathbf{x}},{\mathbf{x}}^{\left( t\right) },t}\right) f θ(x ,x(t),t) 的参数化中,以模仿这些ODE求解器。我们强调,(Song et al., 2020a; Lu et al., 2022; Zhang & Chen, 2022) 仍然遵循相同的参数化来表示得分函数 s ( x ( t ) , t ) s\left( {{x}^{\left( t\right) },t}\right) s(x(t),t),这是通过普通的扩散损失学习的。方程6中的自条件ODE求解器仅用于加速生成。自条件的视觉示意图如图2所示。
We can learn the ODE solver through self-conditioning consistency by directly predicting the clean video x ( 0 ) {\mathbf{x}}^{\left( 0\right) } x(0) through a self-conditioning consistency
我们可以通过自条件一致性直接预测干净视频 x ( 0 ) {\mathbf{x}}^{\left( 0\right) } x(0) 来学习ODE求解器。
where x ^ 1 , x ^ 2 {\widehat{\mathbf{x}}}_{1},{\widehat{\mathbf{x}}}_{2} x 1,x 2 are two independent samples from the first-frame-and-language conditioned video generation model,and λ \lambda λ is a hyperparameter for regularizing the similarity between different samples. To enable the “first guess” for x ^ \widehat{\mathbf{x}} x ,we consider f θ ( x ( t ) , t ) {f}_{\theta }\left( {{\mathbf{x}}^{\left( t\right) },t}\right) fθ(x(t),t) ,which is still learned by the vanilla objective for video diffusion as
其中 x ^ 1 , x ^ 2 {\widehat{\mathbf{x}}}_{1},{\widehat{\mathbf{x}}}_{2} x 1,x 2 是从第一帧和语言条件视频生成模型中独立采样的两个样本, λ \lambda λ 是用于正则化不同样本之间相似性的超参数。为了启用 x ^ \widehat{\mathbf{x}} x 的“初步猜测”,我们考虑 f θ ( x ( t ) , t ) {f}_{\theta }\left( {{\mathbf{x}}^{\left( t\right) },t}\right) fθ(x(t),t),它仍然通过视频扩散的普通目标进行学习。
The overall objective for training a self-conditioning-consistent video diffusion model boils down to
训练自条件一致性视频扩散模型的总体目标可以简化为
Note that despite the video generation model f θ {f}_{\theta } fθ and the video refinement model f ^ θ {\widehat{f}}_{\theta } f θ having different input arguments,we can nevertheless share the parameter between f θ {f}_{\theta } fθ and f ^ θ {\widehat{f}}_{\theta } f θ to train a single model for video generation and video refinement. We describe the training process for f θ {f}_{\theta } fθ and f ^ θ {\widehat{f}}_{\theta } f θ in Algorithm 1.
请注意,尽管视频生成模型 f θ {f}_{\theta } fθ 和视频细化模型 f ^ θ {\widehat{f}}_{\theta } f θ 具有不同的输入参数,我们仍然可以在 f θ {f}_{\theta } fθ 和 f ^ θ {\widehat{f}}_{\theta } f θ 之间共享参数,以训练一个单一模型用于视频生成和视频细化。我们在算法1中描述了 f θ {f}_{\theta } fθ 和 f ^ θ {\widehat{f}}_{\theta } f θ 的训练过程。
—— 更多内容请到Doc2X翻译查看——
—— For more content, please visit Doc2X for translations ——