Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation

Minjie Zhu,Yichen Zhu,Jinming Li,Junjie Wen,Zhiyuan Xu,Ning Liu,Ran Cheng,Chaomin Shen,Yaxin Peng,Feifei Feng,Jian Tang
2024-09-22
Abstract:Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (\DP) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely \textbf{\methodname}, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that \DP~suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our utilize non-causal attention which allows the policy network to \enquote{see} future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named \methodname, can effectively scale up the model size with improved performance and generalization. We benchmark \methodname~across 50 different tasks from MetaWorld and find that our largest \methodname~outperforms \DP~with an average improvement of 21.6\%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 36.25\% over DP-T on four single-arm tasks and 75\% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at <a class="link-external link-http" href="http://scaling-diffusion-policy.github.io" rel="external noopener nofollow">this http URL</a>.
Robotics
What problem does this paper attempt to address?
The paper aims to address the issue of poor scalability of Diffusion Policy in the Transformer architecture. Specifically, the study found that the performance of traditional Diffusion Policy decreases rather than improves when increasing the number of model layers or heads. This phenomenon indicates that the existing Diffusion Policy is difficult to effectively scale under the Transformer architecture. To solve this problem, the researchers proposed the Scalable Diffusion Transformer Policy (ScaleDP), which improves training dynamics and enables the network to better handle multimodal action distributions by introducing two key modules: 1. **Gradient Problem Resolution**: The study found that traditional DP-T has significant gradient issues, leading to unstable optimization. To address this problem, the proposed method decomposes the observation feature embedding into multiple affine layers and integrates them into the Transformer blocks. 2. **Non-Causal Attention Mechanism**: By using a non-causal attention mechanism, the policy network is allowed to "see" future actions during prediction, thereby reducing accumulated errors. Experimental results show that the proposed ScaleDP method can successfully scale the model parameters from 10 million to 1 billion and significantly outperforms the baseline model DP-T on multiple tasks. Additionally, as the model size increases, the model exhibits better data adaptability and generalization ability. These improvements make ScaleDP perform excellently in various robotic manipulation tasks and possess stronger visual generalization capabilities, maintaining a high success rate under different backgrounds, lighting conditions, and various disturbances.