Abstract:Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (\DP) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely \textbf{\methodname}, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that \DP~suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our utilize non-causal attention which allows the policy network to \enquote{see} future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named \methodname, can effectively scale up the model size with improved performance and generalization. We benchmark \methodname~across 50 different tasks from MetaWorld and find that our largest \methodname~outperforms \DP~with an average improvement of 21.6\%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 36.25\% over DP-T on four single-arm tasks and 75\% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at <a class="link-external link-http" href="http://scaling-diffusion-policy.github.io" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The paper aims to address the issue of poor scalability of Diffusion Policy in the Transformer architecture. Specifically, the study found that the performance of traditional Diffusion Policy decreases rather than improves when increasing the number of model layers or heads. This phenomenon indicates that the existing Diffusion Policy is difficult to effectively scale under the Transformer architecture. To solve this problem, the researchers proposed the Scalable Diffusion Transformer Policy (ScaleDP), which improves training dynamics and enables the network to better handle multimodal action distributions by introducing two key modules: 1. **Gradient Problem Resolution**: The study found that traditional DP-T has significant gradient issues, leading to unstable optimization. To address this problem, the proposed method decomposes the observation feature embedding into multiple affine layers and integrates them into the Transformer blocks. 2. **Non-Causal Attention Mechanism**: By using a non-causal attention mechanism, the policy network is allowed to "see" future actions during prediction, thereby reducing accumulated errors. Experimental results show that the proposed ScaleDP method can successfully scale the model parameters from 10 million to 1 billion and significantly outperforms the baseline model DP-T on multiple tasks. Additionally, as the model size increases, the model exhibits better data adaptability and generalization ability. These improvements make ScaleDP perform excellently in various robotic manipulation tasks and possess stronger visual generalization capabilities, maintaining a high success rate under different backgrounds, lighting conditions, and various disturbances.

Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation

The Ingredients for Robotic Diffusion Transformers

One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Scaling Diffusion Transformers to 16 Billion Parameters

Brain-inspired Action Generation with Spiking Transformer Diffusion Policy Model

Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation

Modular Deep Q Networks for Sim-to-real Transfer of Visuo-motor Policies

EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

AffordDP: Generalizable Diffusion Policy with Transferable Affordance

Unifying Diffusion Models with Action Detection Transformers for Multi-task Robotic Manipulation

Data Scaling Laws in Imitation Learning for Robotic Manipulation