【无标题AGI关键拼图！（附实现代码）智驾传奇团队再出手：UniVLA 打造机器人通用行动指南】

本文链接：https://blog.youkuaiyun.com/soaring_casia/article/details/148421741

智能体如何在复杂多变的环境中高效决策与行动，始终是研究者们不懈探索的核心命题。继智驾领域备受瞩目的 UniAD 将感知、预测、规划一网打尽，展现了强大的端到端能力后，李弘扬老师所领导的OpenDriveLab不久前又提出了机器人领域的“通用行动指南”——UniVLA。

©️【深蓝具身智能】

UniVLA 是一个统一的视觉-语言-行动框架，支持在不同环境下进行策略学习，以无监督的方式推导出以任务为中心的潜在动作，并利用来自任意实体和视角的数据（无需动作标签），在经过大规模视频预训练后，实现了一种跨实体的通用策略，在结合低成本的动作解码器，可以轻松地将UniVLA部署到各种机器人上。与OpenVLA相比，UniVLA在操作和导航任务上性能均有提升。

本篇文章，我们将深度解析这一实现过程的三个关键阶段：潜在动作学习、通用策略预训练，以及部署后训练，并配合复现代码深入解读这一模型。

我们开设此账号，想要向各位对【具身智能】感兴趣的人传递最前沿最权威的知识讯息外，也想和大家一起见证它到底是泡沫还是又一场热浪？

欢迎关注【深蓝具身智能】

方法说明

首先，介绍UniVLA模型的两个关键阶段。

以任务为中心的潜在动作学习

▲图1 | 潜在动作模型的两阶段训练流程©️【深蓝具身智能】编译

class DINO_LAM(LightningModule):    """    A latent action model operates at the DINO latent space    """    def __init__(            self,            ...    ) -> None:        super(DINO_LAM, self).__init__()        assert stage in ['stage-1', 'stage-2']         lam = UncontrolledDINOLatentActionModel if stage == 'stage-1' else ControllableDINOLatentActionModel

代码2：

attention_mask是任务指令的码本，lang_embed是任务指令嵌入，将视频帧、指令嵌入、指令码本输入vq_encode进行VQ-VAE量化编码，然后通过decode解码后得到重建后的图像帧和潜在动作。

(这里以UncontrolledDINOLatentActionModel的forward进行说明，ControllableDINOLatentActionModel的forward大致相似)

class UncontrolledDINOLatentActionModel(nn.Module):    def forward(self, batch: Dict) -> Dict:        ...        lang_embed, attention_mask = self.encode_text(batch["task_instruction"])        lang_embed = self.lang_proj(lang_embed)        attention_mask = torch.cat([torch.ones((B, self.num_codes + (H // self.patch_size)**2)).to(self.device),                                    attention_mask],                                    dim = -1)         outputs = self.vq_encode(batch["videos"], repeat(lang_embed, 'b l d -> b T l d', T=T), attention_mask.repeat(T, 1))         video_patches = self.patch_up(outputs["patches"][:, :-1])        action_patches = self.action_up(outputs["z_q"])        video_action_patches = torch.cat([action_patches, video_patches], dim=2)         # Decode        video_recon = self.decoder(video_action_patches, outputs['lang_emb'], attention_mask)        video_recon = video_recon[:, :, self.num_codes: self.num_codes + video_patches.shape[2]]        ...

代码3：

outputs=self.lam(batch)，这一句将拿到解码器重建后的结果，和真值进行比较，得到训练时的总损失。

class DINO_LAM(LightningModule):    def shared_step(self, batch: Dict) -> Tuple:        # batch: keys['videos', 'task_instruction', 'action', 'dataset_names']         outputs = self.lam(batch)        gt_future_frames = outputs["target"]         # Compute loss        mse_loss = ((gt_future_frames - outputs["recon"]) ** 2).mean()        q_loss = ((outputs["emb"].detach() - outputs["z"]) ** 2).mean()        commit_loss = ((outputs["emb"] - outputs["z"].detach()) ** 2).mean()         loss = mse_loss + q_loss + self.vq_beta * commit_loss        ...

通用策略的预训练

▲图2 | 通用策略的框架 ©️【深蓝具身智能】编译

通用策略骨干网络：

使用Prismatic-7B(该架构集成了由SigLip和DINOv2融合的视觉编码器、视觉嵌入与语言模态对齐的投影层，以及LLaMA-2大型语言模型)

class PrismaticVLM(VLM):    def forward(        self,        ...    ) -> CausalLMOutputWithPast:        """Run a forward pass through the VLM, returning a CausalLMOutputWithPast instance (contains loss)."""        ...        # Run Visual Feature Extraction        with torch.set_grad_enabled(self.vision_backbone_requires_grad):            if isinstance(pixel_values, dict):                patch_features = self.vision_backbone({k: pixel_values[k][multimodal_indices] for k in pixel_values})            else:                patch_features = self.vision_backbone(pixel_values[multimodal_indices])         # Projection Logic :: [bsz, num_patches, llm_embed_dim] =>> num_patches = (2 *) (256 + 1) for ViT-L + CLS        projected_patch_embeddings = self.projector(patch_features)        projected_patch_attention_mask = None        if attention_mask is not None:            projected_patch_attention_mask = torch.full(                (projected_patch_embeddings.shape[0], projected_patch_embeddings.shape[1]),                True,                dtype=attention_mask.dtype,                device=attention_mask.device,            )         # Get Input Embeddings from LLM Backbone :: [bsz, input_seq_len, llm_embed_dim]        input_embeddings = self.llm_backbone.embed_input_ids(input_ids)         # Build Multimodal Embeddings (and build resulting attention mask)        multimodal_embeddings = torch.cat(            [                input_embeddings[multimodal_indices, :1, :],                projected_patch_embeddings,                input_embeddings[multimodal_indices, 1:, :],            ],            dim=1,        )        multimodal_attention_mask = None        if attention_mask is not None:            multimodal_attention_mask = torch.cat(                [                    attention_mask[multimodal_indices, :1],                    projected_patch_attention_mask,                    attention_mask[multimodal_indices, 1:],                ],                dim=1,            )         # [Contract] We assume the first token of `labels` (associated with <BOS>) is already marked as "IGNORE"        #   => We'll ignore the per-token outputs for each of the patch embeddings as well!        multimodal_labels = None        if labels is not None:            projected_patch_labels = torch.full(                (projected_patch_embeddings.shape[0], projected_patch_embeddings.shape[1]),                IGNORE_INDEX,                dtype=labels.dtype,                device=labels.device,            )            multimodal_labels = torch.cat(                [labels[multimodal_indices, :1], projected_patch_labels, labels[multimodal_indices, 1:]], dim=1            )         # === Add Unimodal Handling ===         # Create Fused Embeddings, Attention Mask, and Labels by Merging with "unimodal" Inputs (if applicable)        unimodal_indices = torch.tensor(            [idx for idx in range(len(input_ids)) if idx not in multimodal_indices],            dtype=torch.long,            device=multimodal_indices.device,        )         # No "unimodal" data --> Fused == Multimodal        if len(unimodal_indices) == 0:            fused_embeddings = multimodal_embeddings            fused_attention_mask = multimodal_attention_mask            fused_labels = multimodal_labels         else:            # Otherwise --> Merge w/ unimodal data             # This doesn't matter --> but in the "normal" case this is the embedding of the <PAD> token            #   => NOTE :: Verified that `zeros/randn/empty/<PAD> embedding` all return the same result!            unimodal_embeddings_pad = torch.zeros(                (len(unimodal_indices), projected_patch_embeddings.shape[1], input_embeddings.shape[2]),                dtype=input_embeddings.dtype,                device=input_embeddings.device,            )            unimodal_attention_pad = torch.full(                (len(unimodal_indices), projected_patch_embeddings.shape[1]),                False,                dtype=attention_mask.dtype,                device=attention_mask.device,            )            unimodal_labels_pad = torch.full(                (len(unimodal_indices), projected_patch_embeddings.shape[1]),                IGNORE_INDEX,                dtype=labels.dtype,                device=labels.device,            )             unimodal_embeddings = torch.cat([input_embeddings[unimodal_indices], unimodal_embeddings_pad], dim=1)            unimodal_attention_mask = torch.cat([attention_mask[unimodal_indices], unimodal_attention_pad], dim=1)            unimodal_labels = torch.cat([labels[unimodal_indices], unimodal_labels_pad], dim=1)             # Create "Fused" Tensors by Stacking Multimodal & Unimodal            fused_embeddings = torch.vstack([multimodal_embeddings, unimodal_embeddings])            fused_attention_mask = torch.vstack([multimodal_attention_mask, unimodal_attention_mask])            fused_labels = torch.vstack([multimodal_labels, unimodal_labels])         # Run LLM Forward --> returns CausalLMOutputWithPast!        return self.llm_backbone(            input_ids=None,            attention_mask=fused_attention_mask,            position_ids=None,            past_key_values=past_key_values,            inputs_embeds=fused_embeddings,            labels=fused_labels,            use_cache=use_cache,            output_attentions=output_attentions,            output_hidden_states=output_hidden_states,            return_dict=return_dict,        )

部署后训练

潜在动作解码器：

在下游任务的部署阶段，预训练的通用策略预测的下一个潜在动作是与实体无关的，为了弥合潜在动作与任务可执行行为之间的差距，需要额外运用一个动作解码器，即潜在动作解码，具体说明如下：

先通过多头注意力池化将视觉特征嵌入聚合成单一的标记，作为后续潜在动作查询的令牌。这一过程可以表示为：

代码说明：潜在动作编码器相关代码实现

class ActionDecoder(torch.nn.Module):    def __init__(self, window_size = 12, hidden_dim = 512):        super().__init__()        self.latent_action_pool = MAPBlock(n_latents = 1, vis_dim = 4096, embed_dim = hidden_dim, n_heads = hidden_dim // 64)        self.visual_pool = MAPBlock(n_latents = 1, vis_dim = 4096, embed_dim = hidden_dim, n_heads = hidden_dim // 64)         self.proj = nn.Sequential(                                nn.Linear(hidden_dim, 7 * window_size),                                nn.Tanh(),                    )     def forward(self, latent_action_tokens, visual_embed):        visual_embed = self.visual_pool(visual_embed)        latent_action_tokens = latent_action_tokens[:, -4:]        action_token = self.latent_action_pool(latent_action_tokens, init_embed = visual_embed)         action = self.proj(action_token)         return action

从历史输出中学习:

UniVLA提出利用历史潜在动作来促进机器人控制中的决策：即在每次时间步的输入提示中加入过去的动作。这为机器人策略建立了反馈循环，使策略能够从自己的决策中学习并适应动态环境。

为了实现这种方法，在使用潜在动作模型时，需要对历史动作进行标注。这些标注的动作随后被映射到LLaMA的分词词汇表中，并附加到任务指令上。

在后训练阶段，将历史动作输入整合到模型训练中，赋予模型上下文学习的能力。

在推理时，除了初始步骤外，每个步骤的输入提示中都加入前一步的历史潜在动作（编码为N=4的标记）。

代码说明：如果想要喂进/预测多步action，可以修改window_size

vla_dataset = RLDSDataset(    cfg.data_root_dir,    cfg.dataset_name,    batch_transform,    resize_resolution=tuple(wrapped_model.module.vla.config.image_sizes),    shuffle_buffer_size=cfg.shuffle_buffer_size,    image_aug=cfg.image_aug,    window_size=cfg.window_size + 1,        # for constructing history latent actions    training_phase='post-training',)