DNAct: Diffusion Guided Multi-Task 3D Policy Learning

Ge Yan,Yueh-Hua Wu,Xiaolong Wang
2024-03-08
Abstract:This paper presents DNAct, a language-conditioned multi-task policy framework that integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces. To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene. Consequently, it allows various applications to challenging robotic tasks requiring rich 3D semantics and accurate geometry. Furthermore, we introduce a novel approach utilizing diffusion training to learn a vision and language feature that encapsulates the inherent multi-modality in the multi-task demonstrations. By reconstructing the action sequences from different tasks via the diffusion process, the model is capable of distinguishing different modalities and thus improving the robustness and the generalizability of the learned representation. DNAct significantly surpasses SOTA NeRF-based multi-task manipulation approaches with over 30% improvement in success rate. Project website:
Robotics,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Generalization ability of multi - task robot manipulation**: - When performing multi - task robot manipulation in a complex environment, learning a multi - task policy that can be generalized from a small number of demonstrations is a major challenge. Traditional methods usually require large - scale datasets to obtain comprehensive 3D semantic understanding, which is difficult to achieve in the real world. - The method proposed in the paper (DNAct) aims to overcome these limitations through neural rendering pre - training and diffusion training, enabling the model to learn from a small number of demonstrations and perform well in unseen tasks and scenarios. 2. **Handling multi - modal trajectories**: - In multi - task demonstrations, the trajectories of different tasks may have similarities, but they will also exhibit multi - modal characteristics. For example, when avoiding obstacles in the kitchen to pick up a knife, there may be multiple different path choices. - DNAct uses diffusion training to identify these multi - modal trajectories, thereby improving the robustness and generalization ability of the model and ensuring that it is not biased towards a specific mode. 3. **3D semantic and geometric understanding**: - In order to perform fine - grained operations in a complex environment, the robot needs to have a comprehensive geometric and semantic understanding of the scene. Although existing NeRF methods perform well in single - task settings, they lack semantic understanding in complex environments. - DNAct generates a 3D semantic representation containing common - sense priors by distilling semantic features from 2D base models into 3D space, thereby improving the understanding ability of complex environments. 4. **Reducing inference time and the number of parameters**: - Although diffusion models are widely used in generative models, they have the problem of long inference time in practical robot applications, which limits their application in real - time tasks. - DNAct reduces the inference time by combining diffusion training and a policy network, and can achieve better performance with fewer parameters, making it more suitable for practical robot tasks. In summary, this paper aims to solve the problems of generalization, multi - modal processing, 3D semantic understanding, and efficient inference in multi - task robot manipulation by combining neural rendering pre - training and diffusion training, thereby enhancing the robot's manipulation ability in complex environments.