VLA之外,具身+VA工作汇总

点击下方卡片,关注“具身智能之心”公众号


>>点击进入→具身智能之心技术交流群

更多干货,欢迎加入国内首个具身智能全栈学习社区具身智能之心知识星球(戳我)这里包含所有你想要的。

前面一直分享VLA相关工作,这里也为大家汇总下具身+VA相关的工作( Vision + Action),涉及机器人操作、DP、全身控制、One-Shot 、sim2real、端到端等;内容出自具身智能之心知识星球!

2025年工作

  • [2025] Steering Your Diffusion Policy with Latent Space Reinforcement Learning

  • [2025] [ByteDance Seed] Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

  • [2025] [RSS 25] Unified Video Action Model

  • [2025] Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories

  • [2025] Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition

  • [2025] Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning

  • [2025] BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities

  • [2025] [RSS 25] Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation

  • [2025] Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics

  • [2025] You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations

  • [2025] ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills

  • [2025] VILP: Imitation Learning with Latent Video Planning

  • [2025] Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

  • [2025] When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning

  • [2025] RoboGrasp: A Universal Grasping Policy for Robust Robotic Control

  • [2025] CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

  • [2025] Learning to Group and Grasp Multiple Objects

  • [2025] Beyond Behavior Cloning: Robustness through Interactive Imitation and Contrastive Learning

  • [2025] COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping

  • [2025] DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References

  • [2025] S2-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation

  • [2025] MTDP: Modulated Transformer Diffusion Policy Model

  • [2025] FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

  • [2025] RHINO: Learning Real-Time Humanoid-Human-Object Interaction from Human Demonstrations

  • [2025] Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control

  • [2025] Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum

  • [2025] IMLE Policy: Fast and Sample Efficient Visuomotor Policy Learning via Implicit Maximum Likelihood Estimation

  • [2025] X-IL: Exploring the Design Space of Imitation Learning Policies

  • [2025] Towards Fusing Point Cloud and Visual Representations for Imitation Learning

  • [2025] Pick-and-place Manipulation Across Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach

  • [2025] FACTR: Force-Attending Curriculum Training for Contact-Rich Policy Learning

  • [2025] DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning

  • [2025] Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

  • [2025] AnyDexGrasp: General Dexterous Grasping for Different Hands with Human-level Learning Efficiency

  • [2025] COMPASS: Cross-embOdiment Mobility Policy via ResiduAl RL and Skill Synthesis

  • [2025] Retrieval Dexterity: Efficient Object Retrieval in Clutters with Dexterous Hand

  • [2025] From planning to policy: distilling Skill-RRT for long-horizon prehensile and non-prehensile manipulation

  • [2025] FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real

  • [2025] Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation

  • [2025] FuseGrasp: Radar-Camera Fusion for Robotic Grasping of Transparent Objects

  • [2025] Sensor-Invariant Tactile Representation

  • [2025] Generalist World Model Pre-Training for Efficient Reinforcement Learning

  • [2025] ProDapt: Proprioceptive Adaptation using Long-term Memory Diffusion

  • [2025] Falcon: Fast Visuomotor Policies via Partial Denoising

  • [2025] HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models

  • [2025] SHADOW: Leveraging Segmentation Masks for Cross-Embodiment Policy Transfer

  • [2025] Phantom: Training Robots Without Robots Using Only Human Videos

  • [2025] General Force Sensation for Tactile Robot

  • [2025] Action Tokenizer Matters in In-Context Imitation Learning

  • [2025] AVR: Active Vision-Driven Robotic Precision Manipulation with Viewpoint and Focal Length Optimization

  • [2025] FRMD: Fast Robot Motion Diffusion with Consistency-Distilled Movement Primitives for Smooth Action Generation

  • [2025] Variable-Friction In-Hand Manipulation for Arbitrary Objects via Diffusion-Based Imitation Learning

  • [2025] Learning Dexterous In-Hand Manipulation with Multifingered Hands via Visuomotor Diffusion

  • [2025] RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking

  • [2025] ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation

  • [2025] SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks

  • [2025] GAGrasp: Geometric Algebra Diffusion for Dexterous Grasping

  • [2025] OPG-Policy: Occluded Push-Grasp Policy Learning with Amodal Segmentation

  • [2025] RA-DP: Rapid Adaptive Diffusion Policy for Training-Free High-frequency Robotics Replanning

  • [2025] Robotic Compliant Object Prying Using Diffusion Policy Guided by Vision and Force Observations

  • [2025] CoinRobot: Generalized End-to-end Robotic Learning for Physical Intelligence

  • [2025] Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects

  • [2025] How to Train Your Robots? The Impact of Demonstration Modality on Imitation Learning

  • [2025] One-Shot Dual-Arm Imitation Learning

  • [2025] GAT-Grasp: Gesture-Driven Affordance Transfer for Task-Aware Robotic Grasping

  • [2025] Enhanced View Planning for Robotic Harvesting: Tackling Occlusions with Imitation Learning

  • [2025] ES-Parkour: Advanced Robot Parkour with Bio-inspired Event Camera and Spiking Neural Network

  • [2025] NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

  • [2025] World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

  • [2025] RILe: Reinforced Imitation Learning

  • [2025] HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots

  • [2025] Distillation-PPO: A Novel Two-Stage Reinforcement Learning Framework for Humanoid Robot Perceptive Locomotion

  • [2025] Trinity: A Modular Humanoid Robot AI System

  • [2025] LiPS: Large-Scale Humanoid Robot Reinforcement Learning with Parallel-Series Structures

  • [2025] Elastic Motion Policy: An Adaptive Dynamical System for Robust and Efficient One-Shot Imitation Learning

  • [2025] Learning Gentle Grasping Using Vision, Sound, and Touch

  • [2025] RoboCopilot: Human-in-the-loop Interactive Imitation Learning for Robot Manipulation

  • [2025] Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework

  • [2025] MoE-Loco: Mixture of Experts for Multitask Locomotion

  • [2025] Humanoid Policy ~ Human Policy

  • [2025] Dense Policy: Bidirectional Autoregressive Learning of Actions

  • [2025] Learning to Play Piano in the Real World

  • [2025] CCDP: Composition of Conditional Diffusion Policies with Guided Sampling

  • [2025] DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation [2025] AdaWorld: Learning Adaptable World Models with Latent Actions

  • [2025] Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing

  • [2025] Empirical Analysis of Sim-and-Real Cotraining Of Diffusion Policies For Planar Pushing from Pixels

  • [2025] ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning

  • [2025] Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation

  • [2025] HACTS: a Human-As-Copilot Teleoperation System for Robot Learning

  • [2025] ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos

  • [2025] Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models

  • [2025] Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

  • [2025] RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics

  • [2025] Slot-Level Robotic Placement via Visual Imitation from Single Human Video

  • [2025] Robust Dexterous Grasping of General Objects from Single-view Perception

  • [2025] Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation

  • [2025] ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping

  • [2025] Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation

  • [2025] Grasping Deformable Objects via Reinforcement Learning with Cross-Modal Attention to Visuo-Tactile Inputs

  • [2025] Few-Shot Vision-Language Action-Incremental Policy Learning

  • [2025] Latent Diffusion Planning for Imitation Learning

  • [2025] Physically Consistent Humanoid Loco-Manipulation using Latent Diffusion Models

  • [2025] PRISM-DP: Spatial Pose-based Observations for Diffusion-Policies via Segmentation, Mesh Generation, and Pose Tracking

  • [2025] Rethinking Latent Representations in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation

  • [2025] Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

  • [2025] Fast Flow-based Visuomotor Policies via Conditional Optimal Transport Couplings

  • [2025] KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation

  • [2025] CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations

  • [2025] H3DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning

  • [2025] UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations

  • [2025] Learning Long-Context Diffusion Policies via Past-Token Prediction

  • [2025] DataMIL: Selecting Data for Robot Imitation Learning with Datamodels

  • [2025] [ICLR 25] Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

  • [2025] IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning

  • [2025] NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning

  • [2025] EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation

  • [2025] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

  • [2025] Conditioning Matters: Training Diffusion Policies is Faster Than You Think

  • [2025] H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos

  • [2025] GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation

  • [2025] Zero-Shot Visual Generalization in Robot Manipulation

  • [2025] Object-Centric Representations Improve Policy Generalization in Robot Manipulation

  • [2025] LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation

  • [2025] GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation

  • [2025] A Practical Guide for Incorporating Symmetry in Diffusion Policy

  • [2025] Adaptive Visuo-Tactile Fusion with Predictive Force Attention for Dexterous Manipulation

  • [2025] EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation

  • [2025] Spatial RoboGrasp: Generalized Robotic Grasping Control Policy

  • [2025] Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt

  • [2025] [AAAI 25] FlowPolicy: Enabling Fast and Robust 3D Flow-Based Policy via Consistency Flow Matching for Robot Manipulation

  • [2025] Object-centric 3D Motion Field for Robot Learning from Human Videos

  • [2025] Evaluating Robot Policies in a World Model

  • [2025] 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

  • [2025] SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game

  • [2025] SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies

  • [2025] Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation

  • [2025] Touch begins where vision ends: Generalizable policies for contact-rich manipulation

  • [2025] AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

  • [2025] GAF: Gaussian Action Field as a Dynamic World Model for Robotic Manipulation

  • [2025] Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation

  • [2025] Latent Action Diffusion for Cross-Embodiment Manipulation

  • [2025] Vision in Action: Learning Active Perception from Human Demonstrations

  • [2025] [IROS 25] Robust Instant Policy: Leveraging Student’s t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation

  • [2025] [RSS 25] Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation

  • [2025] DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy

  • [2025] World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation

  • [2025] ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

  • [2025] [ICCV 25] Spatial-Temporal Aware Visuomotor Diffusion Policy Learning

2024年工作

  • [2024] Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching

  • [2024] Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning

  • [2024] [RSS 25] 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

  • [2024] Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning

  • [2024] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation

  • [2024] 3d diffuser actor: Policy diffusion with 3d scene representations

  • [2024] [ICLR 25] Diffusion Policy Policy Optimization

  • [2024] Language-Guided Object-Centric Diffusion Policy for Collision-Aware Robotic Manipulation

  • [2024] EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

  • [2024] Equivariant Diffusion Policy

  • [2024] [IROS 25] Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models

  • [2024] Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

  • [2024] Motion Before Action: Diffusing Object Motion as Manipulation Condition

  • [2024] One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

  • [2024] Consistency policy: Accelerated visuomotor policies via consistency distillation

  • [2024] SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation

  • [2024] Few-Shot Task Learning through Inverse Generative Modeling

  • [2024] G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

  • [2024] Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation

  • [2024] Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based Policies

  • [2024] Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

  • [2024] Equivariant diffusion policy

  • [2024] Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation

  • [2024] Data Scaling Laws in Imitation Learning for Robotic Manipulation

  • [2024] Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

  • [2024] Equivariant diffusion policy

  • [2024] Learning universal policies via text-guided video generation

  • [2024] Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

  • [2024] 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

  • [2024] Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

  • [2024] GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy

  • [2024] Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation

  • [2024] Prediction with Action: Visual Policy Learning via Joint Denoising Process

  • [2024] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

  • [2024] Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling

  • [2024] Streaming Diffusion Policy: Fast Policy Synthesis with Variable Noise Diffusion Models

  • [2024] CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

  • [2024] In-Context Imitation Learning via Next-Token Prediction

  • [2024] Learning Diffusion Policies from Demonstrations For Compliant Contact-rich Manipulation

2023年工作

  • [2023] Diffusion policy: Visuomotor policy learning via action diffusion

  • [2023] Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods

### VLM 和 VLA 在机器人学中的区别 在机器人技术领域中,视觉语言模型 (Visual Language Model, VLM) 和视觉定位与映射 (Visual Localization and Mapping, VLA) 是两个不同的概念和技术方向。 #### 视觉语言模型 (VLM) 视觉语言模型是一种多模态的人工智能方法,它结合了自然语言处理和计算机视觉的能力。这种模型能够理解图像或视频的内容并将其与文本描述关联起来。例如,在给定一张图片的情况下,VLM 可以生成相应的文字说明或者根据一段文字找到匹配的图像[^1]。因此,VLM 主要用于解决涉及跨模态理解和生成的任务,比如图文检索、视觉问答以及场景解释等应用。 ```python from transformers import CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") image = ... # load image here text = ["a photo of a cat", "a photo of a dog"] inputs = processor(text=text, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) logits_per_image = outputs.logits_per_image # this is the image-text similarity score probs = logits_per_image.softmax(dim=1) # we can take the softmax to get probability distribution print(probs) ``` 上述代码展示了如何利用预训练好的CLIP模型来计算图像与文本之间的相似度得分,这是典型的基于VLM的应用实例之一。 #### 视觉定位与映射 (VLA) 相比之下,视觉定位与映射更专注于让移动机器人能够在未知环境中自主导航。体来说,VLA 技术允许机器人构建环境的地图同时估计自己的位置。这一过程通常依赖于摄像头捕捉到的数据流来进行实时更新地图信息及精确定位自坐标系下的方位变化情况。相比起注重语义解析能力的VLM,VLA更加关注几何关系的确立及其动态调整机制等方面的工作重点有所不同。 综上所述,虽然两者都涉及到“视觉”的范畴之内,但是它们各自侧重的研究目标存在显著差异:前者致力于实现人类级别甚至超越人类水平的理解力;后者则旨在赋予机器设备独立探索复杂物理空间所需的基础技能集。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值