Abstract:Robots' ability to follow language instructions and execute diverse 3D tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in understanding novel tasks, thereby mitigating this issue. However, these methods lack a task-specific learning process, which is essential for an accurate understanding of 3D environments, often leading to execution failures. In this paper, we introduce GravMAD, a sub-goal-driven, language-conditioned action diffusion framework that combines the strengths of imitation learning and foundation models. Our approach breaks tasks into sub-goals based on language instructions, allowing auxiliary guidance during both training and inference. During training, we introduce Sub-goal Keypose Discovery to identify key sub-goals from demonstrations. Inference differs from training, as there are no demonstrations available, so we use pre-trained foundation models to bridge the gap and identify sub-goals for the current task. In both phases, GravMaps are generated from sub-goals, providing flexible 3D spatial guidance compared to fixed 3D positions. Empirical evaluations on RLBench show that GravMAD significantly outperforms state-of-the-art methods, with a 28.63% improvement on novel tasks and a 13.36% gain on tasks encountered during training. These results demonstrate GravMAD's strong multi-task learning and generalization in 3D manipulation. Video demonstrations are available at: <a class="link-external link-https" href="https://gravmad.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to enable robots to perform diverse tasks in 3D environments according to natural - language instructions, and still maintain high accuracy and generalization ability when facing new tasks that have not been seen before. Specifically, although traditional imitation - learning methods perform well on known tasks, they perform poorly when dealing with new tasks due to the diversity of environments and tasks. Some recent studies have attempted to use large - scale pre - trained models (such as foundation models) to help understand new tasks, but these methods lack a learning process for specific tasks, resulting in possible failures in understanding and performing 3D tasks. Therefore, this paper proposes a method that combines the advantages of imitation learning and foundation models to improve the multi - task learning and generalization ability of robots in 3D manipulation tasks. To solve the above problems, the author introduced GravMAD (Grounded Spatial Value Maps Guided Action Diffusion), which is a sub - goal - driven language - conditional action - diffusion framework. The main features of GravMAD include: 1. **Sub - goal discovery**: Through the "Sub - goal Keypose Discovery" method, identify key sub - goals from demonstration data during the training stage; during the inference stage, use the pre - trained foundation model to identify sub - goals. 2. **Generate GravMaps**: Generate Grounded Spatial Value Maps (GravMaps) according to sub - goals. These maps can convert language instructions into sub - goals in 3D space and reflect the spatial relationships in the environment. 3. **Action - diffusion guidance**: Use GravMaps to guide the action - diffusion process, enabling the robot to gradually denoise random noise into precise end - effector postures according to 3D visual observations, language instructions, and the guidance of GravMaps. Through this method, GravMAD not only performs well on tasks encountered during the training process, but also has significantly better generalization ability on new tasks than existing methods. Experimental results show that in the RLBench benchmark test, GravMAD increases the success rate of new tasks by 28.63% and the success rate of tasks encountered during training by 13.36%. In summary, this paper aims to solve the balance problem between precise control and generalization ability of robots in 3D manipulation tasks by combining the advantages of imitation learning and foundation models.

GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation

A Novel Robotic Grasping Method for Moving Objects Based on Multi-Agent Deep Reinforcement Learning

GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy

Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots

Learning Generalizable 3D Manipulation With 10 Demonstrations

Ground4Act: Leveraging Visual-Language Model for Collaborative Pushing and Grasping in Clutter

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Transferring Foundation Models for Generalizable Robotic Manipulation

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation

GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping

A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter

Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich Bimanual Manipulation

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

HACMan++: Spatially-Grounded Motion Primitives for Manipulation

Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

A Deep Learning Approach to Grasping the Invisible