Abstract:This paper presents DNAct, a language-conditioned multi-task policy framework that integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces. To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene. Consequently, it allows various applications to challenging robotic tasks requiring rich 3D semantics and accurate geometry. Furthermore, we introduce a novel approach utilizing diffusion training to learn a vision and language feature that encapsulates the inherent multi-modality in the multi-task demonstrations. By reconstructing the action sequences from different tasks via the diffusion process, the model is capable of distinguishing different modalities and thus improving the robustness and the generalizability of the learned representation. DNAct significantly surpasses SOTA NeRF-based multi-task manipulation approaches with over 30% improvement in success rate. Project website:

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Generalization ability of multi - task robot manipulation**: - When performing multi - task robot manipulation in a complex environment, learning a multi - task policy that can be generalized from a small number of demonstrations is a major challenge. Traditional methods usually require large - scale datasets to obtain comprehensive 3D semantic understanding, which is difficult to achieve in the real world. - The method proposed in the paper (DNAct) aims to overcome these limitations through neural rendering pre - training and diffusion training, enabling the model to learn from a small number of demonstrations and perform well in unseen tasks and scenarios. 2. **Handling multi - modal trajectories**: - In multi - task demonstrations, the trajectories of different tasks may have similarities, but they will also exhibit multi - modal characteristics. For example, when avoiding obstacles in the kitchen to pick up a knife, there may be multiple different path choices. - DNAct uses diffusion training to identify these multi - modal trajectories, thereby improving the robustness and generalization ability of the model and ensuring that it is not biased towards a specific mode. 3. **3D semantic and geometric understanding**: - In order to perform fine - grained operations in a complex environment, the robot needs to have a comprehensive geometric and semantic understanding of the scene. Although existing NeRF methods perform well in single - task settings, they lack semantic understanding in complex environments. - DNAct generates a 3D semantic representation containing common - sense priors by distilling semantic features from 2D base models into 3D space, thereby improving the understanding ability of complex environments. 4. **Reducing inference time and the number of parameters**: - Although diffusion models are widely used in generative models, they have the problem of long inference time in practical robot applications, which limits their application in real - time tasks. - DNAct reduces the inference time by combining diffusion training and a policy network, and can achieve better performance with fewer parameters, making it more suitable for practical robot tasks. In summary, this paper aims to solve the problems of generalization, multi - modal processing, 3D semantic understanding, and efficient inference in multi - task robot manipulation by combining neural rendering pre - training and diffusion training, thereby enhancing the robot's manipulation ability in complex environments.

DNAct: Diffusion Guided Multi-Task 3D Policy Learning

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Language-Guided Object-Centric Diffusion Policy for Collision-Aware Robotic Manipulation

GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy

Diffusion Transformer Policy

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Prediction with Action: Visual Policy Learning via Joint Denoising Process

Multi-task Manipulation Policy Modeling with Visuomotor Latent Diffusion

Diffusion Co-Policy for Synergistic Human-Robot Collaborative Tasks

NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner