GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation

Yangtao Chen,Zixuan Chen,Junhui Yin,Jing Huo,Pinzhuo Tian,Jieqi Shi,Yang Gao
2024-10-06
Abstract:Robots' ability to follow language instructions and execute diverse 3D tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in understanding novel tasks, thereby mitigating this issue. However, these methods lack a task-specific learning process, which is essential for an accurate understanding of 3D environments, often leading to execution failures. In this paper, we introduce GravMAD, a sub-goal-driven, language-conditioned action diffusion framework that combines the strengths of imitation learning and foundation models. Our approach breaks tasks into sub-goals based on language instructions, allowing auxiliary guidance during both training and inference. During training, we introduce Sub-goal Keypose Discovery to identify key sub-goals from demonstrations. Inference differs from training, as there are no demonstrations available, so we use pre-trained foundation models to bridge the gap and identify sub-goals for the current task. In both phases, GravMaps are generated from sub-goals, providing flexible 3D spatial guidance compared to fixed 3D positions. Empirical evaluations on RLBench show that GravMAD significantly outperforms state-of-the-art methods, with a 28.63% improvement on novel tasks and a 13.36% gain on tasks encountered during training. These results demonstrate GravMAD's strong multi-task learning and generalization in 3D manipulation. Video demonstrations are available at: <a class="link-external link-https" href="https://gravmad.github.io" rel="external noopener nofollow">this https URL</a>.
Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enable robots to perform diverse tasks in 3D environments according to natural - language instructions, and still maintain high accuracy and generalization ability when facing new tasks that have not been seen before. Specifically, although traditional imitation - learning methods perform well on known tasks, they perform poorly when dealing with new tasks due to the diversity of environments and tasks. Some recent studies have attempted to use large - scale pre - trained models (such as foundation models) to help understand new tasks, but these methods lack a learning process for specific tasks, resulting in possible failures in understanding and performing 3D tasks. Therefore, this paper proposes a method that combines the advantages of imitation learning and foundation models to improve the multi - task learning and generalization ability of robots in 3D manipulation tasks. To solve the above problems, the author introduced GravMAD (Grounded Spatial Value Maps Guided Action Diffusion), which is a sub - goal - driven language - conditional action - diffusion framework. The main features of GravMAD include: 1. **Sub - goal discovery**: Through the "Sub - goal Keypose Discovery" method, identify key sub - goals from demonstration data during the training stage; during the inference stage, use the pre - trained foundation model to identify sub - goals. 2. **Generate GravMaps**: Generate Grounded Spatial Value Maps (GravMaps) according to sub - goals. These maps can convert language instructions into sub - goals in 3D space and reflect the spatial relationships in the environment. 3. **Action - diffusion guidance**: Use GravMaps to guide the action - diffusion process, enabling the robot to gradually denoise random noise into precise end - effector postures according to 3D visual observations, language instructions, and the guidance of GravMaps. Through this method, GravMAD not only performs well on tasks encountered during the training process, but also has significantly better generalization ability on new tasks than existing methods. Experimental results show that in the RLBench benchmark test, GravMAD increases the success rate of new tasks by 28.63% and the success rate of tasks encountered during training by 13.36%. In summary, this paper aims to solve the balance problem between precise control and generalization ability of robots in 3D manipulation tasks by combining the advantages of imitation learning and foundation models.