Abstract:In this paper, we present DiffusionVLA, a novel framework that seamlessly combines the autoregression model with the diffusion model for learning visuomotor policy. Central to our approach is a next-token prediction objective, enabling the model to reason effectively over the user's query in the context of current observations. Subsequently, a diffusion model is attached to generate robust action outputs. To enhance policy learning through self-reasoning, we introduce a novel reasoning injection module that integrates reasoning phrases directly into the policy learning process. The whole framework is simple and flexible, making it easy to deploy and upgrade. We conduct extensive experiments using multiple real robots to validate the effectiveness of DiffusionVLA. Our tests include a challenging factory sorting task, where DiffusionVLA successfully categorizes objects, including those not seen during training. We observe that the reasoning module makes the model interpretable. It allows observers to understand the model thought process and identify potential causes of policy failures. Additionally, we test DiffusionVLA on a zero-shot bin-picking task, achieving 63.7\% accuracy on 102 previously unseen objects. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiffusionVLA can follow novel instructions and retain conversational ability. Notably, DiffusionVLA is data-efficient and fast at inference; our smallest DiffusionVLA-2B runs 82Hz on a single A6000 GPU and can train from scratch on less than 50 demonstrations for a complex task. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to combine the advantages of autoregressive models and diffusion models in robotic manipulation in order to achieve more efficient and more general visual - language - action (VLA) policy learning. Specifically, the paper proposes a new framework named DiVLA (Diffusion - VLA), aiming to enhance the robot's self - reasoning ability and action - generation ability through unifying autoregressive and diffusion modeling. The following are the main problems that the paper attempts to solve: 1. **Continuity and Precision of Action Data**: - Autoregressive models usually discretize continuous action data into fixed - size tokens, which will disrupt the coherence and precision of actions. - Diffusion models perform well in handling multimodal action distributions and can generate action sequences more quickly, but lack the reasoning ability of autoregressive models. 2. **Real - Time Performance**: - Existing VLA models are less efficient in generating actions, especially in robotic applications that require real - time responses. - DiVLA achieves high - speed reasoning by combining the reasoning ability of autoregressive models and the efficient action generation of diffusion models (for example, DiVLA - 2B can reach an 82Hz reasoning speed on a single A6000 GPU). 3. **Visual Generalization Ability**: - Existing models perform poorly when facing new backgrounds or distractors. - DiVLA shows strong robustness to visual changes and can maintain high performance in dynamic environments. 4. **Zero - Shot Task Adaptability**: - Existing models perform poorly when handling unseen objects or tasks. - DiVLA achieves an accuracy rate of 63.7% in the zero - shot grasping task, showing its generalization ability in handling unseen objects. 5. **Adaptability to New Forms**: - Existing models require a large amount of adjustment when adapting to different robot forms. - DiVLA can easily adapt to different robot forms and achieve high performance on dual - arm robots with minimal adjustment. 6. **Enhancement of Reasoning Ability**: - Existing models lack a transparent reasoning process when performing complex tasks. - DiVLA enhances the interpretability and robustness of the model by introducing a reasoning injection module to directly embed the reasoning signal into the policy model. Through these improvements, DiVLA not only outperforms existing VLA models in performance but also shows its strong generalization ability and adaptability in multiple practical tasks.

Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

Diffusion Transformer Policy

Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

OpenVLA: An Open-Source Vision-Language-Action Model

LVDiffusor: Distilling Functional Rearrangement Priors from Large Models into Diffusor

DiffuseLoco: Real-Time Legged Locomotion Control with Diffusion from Offline Datasets

One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks

Modular Deep Q Networks for Sim-to-real Transfer of Visuo-motor Policies

LaVin-DiT: Large Vision Diffusion Transformer

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning

Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning