Abstract:Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose \textbf{Discrete Policy}, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is 26\% higher than Diffusion Policy and 15\% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5\%, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges brought by the complexity and diversity of the action space in multi - task robot manipulation. Specifically, traditional robot systems are usually focused on specific tasks, but in modern dynamic environments, robots need to have the versatility to adapt to various situations. Since the action distributions of different tasks are often multimodal and become more complex and entangled as the number of tasks increases, this makes it difficult to learn and execute multiple tasks. To address this challenge, the author proposes a method named "Discrete Policy", aiming to disentangle the action space in multi - task robot manipulation through discrete - policy learning. Discrete Policy uses vector quantization to map action sequences to a discrete latent space, thereby facilitating the learning of task - specific codes. These codes are then reconstructed into the action space according to observations and language instructions. Through this method, Discrete Policy can handle complex, multimodal action distributions more effectively and perform well in multi - task environments. ### Key questions 1. **Can it be effectively deployed in real - world scenarios?** 2. **Can it be extended to multiple complex tasks?** 3. **Can it effectively distinguish behavioral patterns in different tasks?** ### Method overview Discrete Policy consists of two main parts: 1. **Training phase 1**: Use the Vector - Quantized Variational Auto - Encoder (VQ - VAE) to encode complex actions into a discrete latent space and reconstruct these actions through a decoder. 2. **Training phase 2**: Utilize a conditional diffusion model to generate task - specific latent embeddings to guide the decoder to execute appropriate action patterns. ### Experimental results Experiments show that Discrete Policy significantly outperforms existing strong baseline methods, such as Diffusion Policy and OpenVLA, on multiple tasks. Especially when the number of tasks increases, the performance advantage of Discrete Policy is more obvious. For example, in a real - world multi - task training setting with 5 tasks, the average success rate of Discrete Policy is 26% higher than that of Diffusion Policy and 15% higher than that of OpenVLA. When the number of tasks increases to 12, the performance gap further expands to 32.5%. ### Conclusion Discrete Policy provides an innovative method for learning multi - task robot control strategies and can achieve better disentanglement of feature representations in complex multi - task environments. Through extensive simulation and practical experiments, the superior performance of this method in multi - task settings has been proven.

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

RoLD: Robot Latent Diffusion for Multi-task Policy Modeling

Multi-task Manipulation Policy Modeling with Visuomotor Latent Diffusion

Leveraging the Efficiency of Multi-Task Robot Manipulation Via Task-Evoked Planner and Reinforcement Learning

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Robust and High-Precision End-to-End Control Policy for Multi-stage Manipulation Task with Behavioral Cloning.

Diffusion Co-Policy for Synergistic Human-Robot Collaborative Tasks

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

PoCo: Policy Composition from and for Heterogeneous Robot Learning

A Comparative Study on State-Action Spaces for Learning Viewpoint Selection and Manipulation with Diffusion Policy

Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

Polybot: Training One Policy Across Robots While Embracing Variability

Hierarchical Visual Policy Learning for Long-Horizon Robot Manipulation in Densely Cluttered Scenes

Diff-DAgger: Uncertainty Estimation with Diffusion Policy for Robotic Manipulation

Multi-task Learning with Gradient Guided Policy Specialization

Learning to Look: Seeking Information for Decision Making via Policy Factorization