Abstract:In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon ($1500+$ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: <a class="link-external link-https" href="https://dit-policy.github.io" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper attempts to address the design choice dilemma encountered when solving increasingly common tasks on dexterous robotic hardware by combining high-capacity Transformer network architectures and generative diffusion models. Specifically, the paper identifies, studies, and improves key architectural design decisions of high-capacity diffusion Transformer strategies to enable the generated models to efficiently solve multiple tasks without the painful hyperparameter tuning for each setup. By combining research findings and improved model components, the paper proposes a new architecture called DiT-Block Policy, which significantly outperforms the state-of-the-art in solving long-horizon (1500+ time steps) dexterous tasks on the dual-arm ALOHA robot. Additionally, the study finds that when trained with 10 hours of highly multimodal, language-annotated ALOHA demonstration data, the policy exhibits better scaling performance. The main contributions of the paper include: 1. **Scalable Attention Blocks**: A key improvement is proposed by adding adaptive layer normalization (adaLN) blocks in the diffusion Transformer policy layers to stabilize the training process. This simple trick improves performance by over 30% when solving long-horizon, dexterous real-world manipulation tasks. 2. **Efficient Observation Tokenization**: Several methods for tokenizing multiple camera observations were compared, such as Vision Transformer and ResNet encoders. It was found that a relatively simple implementation (ResNet image tokenizer + Transformer policy) can provide over 40% performance improvement compared to other strategies. 3. **DiT-Block Policy**: The best-performing components are integrated into a unified framework named DiT-Block Policy. This model achieves state-of-the-art performance on both the low-cost dual-arm ALOHA robot and the single-arm DROID Franka setup. 4. **Open Source Models and Data**: All data, code, and models are made available to the community, including a new language-annotated dataset BiPlay, which contains 7023 clips of dexterous dual-arm manipulation tasks. Through these contributions, the paper aims to pave the way for future robotic learning technologies that can leverage the efficiency of generative diffusion modeling with the scalability of large-scale Transformer architectures.

The Ingredients for Robotic Diffusion Transformers

Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation

Diffusion Transformer Policy

RT-1: Robotics Transformer for Real-World Control at Scale

Diffusion Co-Policy for Synergistic Human-Robot Collaborative Tasks

DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Unifying Diffusion Models with Action Detection Transformers for Multi-task Robotic Manipulation

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation

Brain-inspired Action Generation with Spiking Transformer Diffusion Policy Model

Diff-DAgger: Uncertainty Estimation with Diffusion Policy for Robotic Manipulation

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model

Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning