The Ingredients for Robotic Diffusion Transformers

Sudeep Dasari,Oier Mees,Sebastian Zhao,Mohan Kumar Srirama,Sergey Levine
2024-10-14
Abstract:In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon ($1500+$ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: <a class="link-external link-https" href="https://dit-policy.github.io" rel="external noopener nofollow">this https URL</a>
Robotics,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the design choice dilemma encountered when solving increasingly common tasks on dexterous robotic hardware by combining high-capacity Transformer network architectures and generative diffusion models. Specifically, the paper identifies, studies, and improves key architectural design decisions of high-capacity diffusion Transformer strategies to enable the generated models to efficiently solve multiple tasks without the painful hyperparameter tuning for each setup. By combining research findings and improved model components, the paper proposes a new architecture called DiT-Block Policy, which significantly outperforms the state-of-the-art in solving long-horizon (1500+ time steps) dexterous tasks on the dual-arm ALOHA robot. Additionally, the study finds that when trained with 10 hours of highly multimodal, language-annotated ALOHA demonstration data, the policy exhibits better scaling performance. The main contributions of the paper include: 1. **Scalable Attention Blocks**: A key improvement is proposed by adding adaptive layer normalization (adaLN) blocks in the diffusion Transformer policy layers to stabilize the training process. This simple trick improves performance by over 30% when solving long-horizon, dexterous real-world manipulation tasks. 2. **Efficient Observation Tokenization**: Several methods for tokenizing multiple camera observations were compared, such as Vision Transformer and ResNet encoders. It was found that a relatively simple implementation (ResNet image tokenizer + Transformer policy) can provide over 40% performance improvement compared to other strategies. 3. **DiT-Block Policy**: The best-performing components are integrated into a unified framework named DiT-Block Policy. This model achieves state-of-the-art performance on both the low-cost dual-arm ALOHA robot and the single-arm DROID Franka setup. 4. **Open Source Models and Data**: All data, code, and models are made available to the community, including a new language-annotated dataset BiPlay, which contains 7023 clips of dexterous dual-arm manipulation tasks. Through these contributions, the paper aims to pave the way for future robotic learning technologies that can leverage the efficiency of generative diffusion modeling with the scalability of large-scale Transformer architectures.