发表时间:arxiv Oct 2024
论文链接:ReadPaper
作者单位:Tsinghua University
Motivation:Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. (也是从数据稀缺的角度切入)
解决方法:RDT建立在扩散模型的基础上,有效地表示多模态,具有可扩展的Transformer的创新设计来处理多模态输入的异质性,并捕获机器人数据的非线性和高频。为了解决数据稀缺问题,我们进一步引入了一个物理可解释的统一动作空间,它可以统一各种机器人的动作表示,同时保留原始动作的物理含义,促进学习可转移的物理知识。
统一的动作空间是如何定义的?是如何促进学习可转移的物理知识的?
是一个256维度的向量。(为了进一步使在异构数据上训练 RDT,我们提出了物理可解释的统一动作空间,这是各种带有夹持器臂的机器人的统一动作格式。这种创新的格式减轻了不同机器人之间的潜在冲突,同时保留了原始动作的物理含义,这可以促进模型学习跨不同机器人数据集的可泛化物理知识。)
是当时发表的the largest diffusion-based foundation model for robotic manipulation(1.2B parameters)。
特点:
-
zero-shot generalization to unseen objects and scenes。
-
understands and follows language instructions。
-
learns new skills with just 1∼5 demonstrations(few-shot)。
实现方式:RDT employs Diffusion Transformers (DiTs).
对于表现力,RDT 通过利用扩散模型对复杂分布进行建模的能力,擅长从海量数据中捕获双手动作的完整模式。
对于可扩展性,我们利用 Transformer 主干并精心设计多模态编码以消除各种模态的异质性。
模型输入:
-
Low-Dimensional Inputs are low-dimensional vectors that represent physical quantities of the robot,包括本体感觉、动作块和控制频率。为了对它们进行编码,我们使用 MLP(具有傅立叶特征),它可以有效地捕获低维空间中的高频变化。
-
Image Inputs are high-dimensional and contain rich spatial and semantic information. To extract compact representations, we use an image-text-aligned pre-trained vision encoder, SigLIP. We fix its weights during training to save GPU memory.
-
Language Inputs are of varying length and highly abstract, posing integration challenges due to their complexity and ambiguity. To encode them, we use a pre-trained Transformer-based language model, T5-XXL . We also fix its weights during training to save GPU memory.
对DiT的修改:QKNorm & RMSNorm(稳定计算) + MLP Decoder(为了提高非线性机器人动作的近似能力,我们将最终的线性解码器替换为非线性 MLP 解码器,作为从潜在空间投影回物理空间的投影。) + Alternating Condition Injection(在我们的模型中,图像和语言输入用作长度高维和可变的条件(扩散模型的condition),与传统的 DiT 中的类标签条件形成对比)。
具体扩散过程:
实验:Training on Heterogeneous Multi-Robot Data. Specifically, our collection of pre-training datasets includes 46 datasets of various robots, with a total size of 1M+ trajectories and 21TB.
Collecting a Comprehensive Multi-Task Bimanual Dataset.
消融研究表明,扩散建模、大模型大小和大数据大小都有助于提高性能。
结论:RDT not only demonstrates significant improvements in dexterous bimanual capability and instruction following but also achieves remarkable performance in few-shot learning and zero-shot generalization to unseen objects and scenes。