点击下方卡片,关注“具身智能之心”公众号
更多干货,欢迎加入国内首个具身智能全栈学习社区:具身智能之心知识星球(戳我),这里包含所有你想要的。
前面一直分享VLA相关工作,这里也为大家汇总下具身+VA相关的工作( Vision + Action),涉及机器人操作、DP、全身控制、One-Shot 、sim2real、端到端等;内容出自具身智能之心知识星球!

2025年工作
[2025] Steering Your Diffusion Policy with Latent Space Reinforcement Learning
[2025] [ByteDance Seed] Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation
[2025] [RSS 25] Unified Video Action Model
[2025] Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories
[2025] Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition
[2025] Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning
[2025] BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities
[2025] [RSS 25] Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation
[2025] Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics
[2025] You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations
[2025] ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills
[2025] VILP: Imitation Learning with Latent Video Planning
[2025] Learning the RoPEs: Better 2D and 3D Position Encodings with STRING
[2025] When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning
[2025] RoboGrasp: A Universal Grasping Policy for Robust Robotic Control
[2025] CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World
[2025] Learning to Group and Grasp Multiple Objects
[2025] Beyond Behavior Cloning: Robustness through Interactive Imitation and Contrastive Learning
[2025] COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping
[2025] DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References
[2025] S2-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation
[2025] MTDP: Modulated Transformer Diffusion Policy Model
[2025] FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation
[2025] RHINO: Learning Real-Time Humanoid-Human-Object Interaction from Human Demonstrations
[2025] Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control
[2025] Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum
[2025] IMLE Policy: Fast and Sample Efficient Visuomotor Policy Learning via Implicit Maximum Likelihood Estimation
[2025] X-IL: Exploring the Design Space of Imitation Learning Policies
[2025] Towards Fusing Point Cloud and Visual Representations for Imitation Learning
[2025] Pick-and-place Manipulation Across Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach
[2025] FACTR: Force-Attending Curriculum Training for Contact-Rich Policy Learning
[2025] DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning
[2025] Human2Robot: Learning Robot Actions from Paired Human-Robot Videos
[2025] AnyDexGrasp: General Dexterous Grasping for Different Hands with Human-level Learning Efficiency
[2025] COMPASS: Cross-embOdiment Mobility Policy via ResiduAl RL and Skill Synthesis
[2025] Retrieval Dexterity: Efficient Object Retrieval in Clutters with Dexterous Hand
[2025] From planning to policy: distilling Skill-RRT for long-horizon prehensile and non-prehensile manipulation
[2025] FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real
[2025] Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation
[2025] FuseGrasp: Radar-Camera Fusion for Robotic Grasping of Transparent Objects
[2025] Sensor-Invariant Tactile Representation
[2025] Generalist World Model Pre-Training for Efficient Reinforcement Learning
[2025] ProDapt: Proprioceptive Adaptation using Long-term Memory Diffusion
[2025] Falcon: Fast Visuomotor Policies via Partial Denoising
[2025] HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models
[2025] SHADOW: Leveraging Segmentation Masks for Cross-Embodiment Policy Transfer
[2025] Phantom: Training Robots Without Robots Using Only Human Videos
[2025] General Force Sensation for Tactile Robot
[2025] Action Tokenizer Matters in In-Context Imitation Learning
[2025] AVR: Active Vision-Driven Robotic Precision Manipulation with Viewpoint and Focal Length Optimization
[2025] FRMD: Fast Robot Motion Diffusion with Consistency-Distilled Movement Primitives for Smooth Action Generation
[2025] Variable-Friction In-Hand Manipulation for Arbitrary Objects via Diffusion-Based Imitation Learning
[2025] Learning Dexterous In-Hand Manipulation with Multifingered Hands via Visuomotor Diffusion
[2025] RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking
[2025] ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation
[2025] SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks
[2025] GAGrasp: Geometric Algebra Diffusion for Dexterous Grasping
[2025] OPG-Policy: Occluded Push-Grasp Policy Learning with Amodal Segmentation
[2025] RA-DP: Rapid Adaptive Diffusion Policy for Training-Free High-frequency Robotics Replanning
[2025] Robotic Compliant Object Prying Using Diffusion Policy Guided by Vision and Force Observations
[2025] CoinRobot: Generalized End-to-end Robotic Learning for Physical Intelligence
[2025] Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects
[2025] How to Train Your Robots? The Impact of Demonstration Modality on Imitation Learning
[2025] One-Shot Dual-Arm Imitation Learning
[2025] GAT-Grasp: Gesture-Driven Affordance Transfer for Task-Aware Robotic Grasping
[2025] Enhanced View Planning for Robotic Harvesting: Tackling Occlusions with Imitation Learning
[2025] ES-Parkour: Advanced Robot Parkour with Bio-inspired Event Camera and Spiking Neural Network
[2025] NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models
[2025] World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning
[2025] RILe: Reinforced Imitation Learning
[2025] HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots
[2025] Distillation-PPO: A Novel Two-Stage Reinforcement Learning Framework for Humanoid Robot Perceptive Locomotion
[2025] Trinity: A Modular Humanoid Robot AI System
[2025] LiPS: Large-Scale Humanoid Robot Reinforcement Learning with Parallel-Series Structures
[2025] Elastic Motion Policy: An Adaptive Dynamical System for Robust and Efficient One-Shot Imitation Learning
[2025] Learning Gentle Grasping Using Vision, Sound, and Touch
[2025] RoboCopilot: Human-in-the-loop Interactive Imitation Learning for Robot Manipulation
[2025] Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework
[2025] MoE-Loco: Mixture of Experts for Multitask Locomotion
[2025] Humanoid Policy ~ Human Policy
[2025] Dense Policy: Bidirectional Autoregressive Learning of Actions
[2025] Learning to Play Piano in the Real World
[2025] CCDP: Composition of Conditional Diffusion Policies with Guided Sampling
[2025] DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation [2025] AdaWorld: Learning Adaptable World Models with Latent Actions
[2025] Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing
[2025] Empirical Analysis of Sim-and-Real Cotraining Of Diffusion Policies For Planar Pushing from Pixels
[2025] ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning
[2025] Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation
[2025] HACTS: a Human-As-Copilot Teleoperation System for Robot Learning
[2025] ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos
[2025] Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models
[2025] Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
[2025] RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics
[2025] Slot-Level Robotic Placement via Visual Imitation from Single Human Video
[2025] Robust Dexterous Grasping of General Objects from Single-view Perception
[2025] Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation
[2025] ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping
[2025] Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation
[2025] Grasping Deformable Objects via Reinforcement Learning with Cross-Modal Attention to Visuo-Tactile Inputs
[2025] Few-Shot Vision-Language Action-Incremental Policy Learning
[2025] Latent Diffusion Planning for Imitation Learning
[2025] Physically Consistent Humanoid Loco-Manipulation using Latent Diffusion Models
[2025] PRISM-DP: Spatial Pose-based Observations for Diffusion-Policies via Segmentation, Mesh Generation, and Pose Tracking
[2025] Rethinking Latent Representations in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation
[2025] Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
[2025] Fast Flow-based Visuomotor Policies via Conditional Optimal Transport Couplings
[2025] KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation
[2025] CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations
[2025] H3DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning
[2025] UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations
[2025] Learning Long-Context Diffusion Policies via Past-Token Prediction
[2025] DataMIL: Selecting Data for Robot Imitation Learning with Datamodels
[2025] [ICLR 25] Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
[2025] IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning
[2025] NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning
[2025] EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation
[2025] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation
[2025] Conditioning Matters: Training Diffusion Policies is Faster Than You Think
[2025] H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos
[2025] GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation
[2025] Zero-Shot Visual Generalization in Robot Manipulation
[2025] Object-Centric Representations Improve Policy Generalization in Robot Manipulation
[2025] LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation
[2025] GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation
[2025] A Practical Guide for Incorporating Symmetry in Diffusion Policy
[2025] Adaptive Visuo-Tactile Fusion with Predictive Force Attention for Dexterous Manipulation
[2025] EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation
[2025] Spatial RoboGrasp: Generalized Robotic Grasping Control Policy
[2025] Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt
[2025] [AAAI 25] FlowPolicy: Enabling Fast and Robust 3D Flow-Based Policy via Consistency Flow Matching for Robot Manipulation
[2025] Object-centric 3D Motion Field for Robot Learning from Human Videos
[2025] Evaluating Robot Policies in a World Model
[2025] 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model
[2025] SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game
[2025] SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies
[2025] Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation
[2025] Touch begins where vision ends: Generalizable policies for contact-rich manipulation
[2025] AMPLIFY: Actionless Motion Priors for Robot Learning from Videos
[2025] GAF: Gaussian Action Field as a Dynamic World Model for Robotic Manipulation
[2025] Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation
[2025] Latent Action Diffusion for Cross-Embodiment Manipulation
[2025] Vision in Action: Learning Active Perception from Human Demonstrations
[2025] [IROS 25] Robust Instant Policy: Leveraging Student’s t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation
[2025] [RSS 25] Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation
[2025] DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy
[2025] World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation
[2025] ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
[2025] [ICCV 25] Spatial-Temporal Aware Visuomotor Diffusion Policy Learning
2024年工作
[2024] Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching
[2024] Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning
[2024] [RSS 25] 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
[2024] Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning
[2024] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation
[2024] 3d diffuser actor: Policy diffusion with 3d scene representations
[2024] [ICLR 25] Diffusion Policy Policy Optimization
[2024] Language-Guided Object-Centric Diffusion Policy for Collision-Aware Robotic Manipulation
[2024] EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning
[2024] Equivariant Diffusion Policy
[2024] [IROS 25] Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models
[2024] Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies
[2024] Motion Before Action: Diffusing Object Motion as Manipulation Condition
[2024] One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation
[2024] Consistency policy: Accelerated visuomotor policies via consistency distillation
[2024] SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation
[2024] Few-Shot Task Learning through Inverse Generative Modeling
[2024] G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation
[2024] Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation
[2024] Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based Policies
[2024] Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies
[2024] Equivariant diffusion policy
[2024] Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation
[2024] Data Scaling Laws in Imitation Learning for Robotic Manipulation
[2024] Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation
[2024] Equivariant diffusion policy
[2024] Learning universal policies via text-guided video generation
[2024] Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning
[2024] 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
[2024] Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation
[2024] GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy
[2024] Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation
[2024] Prediction with Action: Visual Policy Learning via Joint Denoising Process
[2024] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
[2024] Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling
[2024] Streaming Diffusion Policy: Fast Policy Synthesis with Variable Noise Diffusion Models
[2024] CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction
[2024] In-Context Imitation Learning via Next-Token Prediction
[2024] Learning Diffusion Policies from Demonstrations For Compliant Contact-rich Manipulation
2023年工作
[2023] Diffusion policy: Visuomotor policy learning via action diffusion
[2023] Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods