Abstract:Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.1 Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to adapt pre-trained video diffusion models into action-conditioned world models without accessing the pre-trained model parameters. Specifically, the authors propose a method called AVID, which modifies the intermediate outputs of the pre-trained model by training an adapter on a small amount of labeled video data from a specific domain, thereby generating accurate action-conditioned videos. ### Background and Challenges 1. **Success of Large-Scale Generative Models**: - Large-scale generative models have achieved significant success in multiple domains, but extending foundational models for decision-making remains a challenge in sequential decision problems (such as robotics) due to the scarcity of labeled action data. 2. **Utilization of Unlabeled Videos**: - Utilizing widely available unlabeled videos to train world models that can simulate the consequences of actions and be used to optimize decision-making in downstream tasks. 3. **Limitations of Existing Models**: - Current image-to-video diffusion models can generate highly realistic synthetic videos, but they are not action-conditioned, and the most powerful models are typically closed-source and cannot be fine-tuned. ### Solution 1. **AVID Method**: - AVID modifies the intermediate outputs of the pre-trained model by training an adapter on a small amount of labeled video data from a specific domain, thereby generating accurate action-conditioned videos. - The adapter uses a learned mask to modify the intermediate outputs of the pre-trained model and generate accurate action-conditioned videos. 2. **Main Contributions**: - Proposes a method to adapt pre-trained video diffusion models into action-conditioned world models without accessing the pre-trained model parameters. - Analyzes the limitations of the adaptation method proposed by Yang et al. (2024b). - Introduces AVID, a novel method for adapting pre-trained diffusion models by applying a learned mask to combine the outputs of the pre-trained model with the conditional outputs of the domain-specific adapter. ### Experimental Results 1. **Evaluation Datasets**: - Evaluated AVID on video game data and real-world robotics data. - Used a model with 140 million parameters as the pre-trained model. 2. **Performance Comparison**: - AVID outperformed existing baseline methods on all evaluation metrics. - Particularly, AVID showed significantly better performance than other methods with smaller model sizes. ### Conclusion This paper demonstrates that the AVID method can effectively adapt pre-trained video diffusion models into action-conditioned world models without accessing the pre-trained model parameters, thereby generating accurate action-conditioned videos. This provides new possibilities for leveraging large-scale pre-trained models in resource-constrained environments.

AVID: Adapting Video Diffusion Models to World Models

Probabilistic Adaptation of Text-to-Video Models

AIM: Adapting Image Models for Efficient Video Action Recognition

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Diffusion Models for Video Prediction and Infilling

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

Video Diffusion Models

Diffusion Transformer Policy

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Grounding Video Models to Actions through Goal Conditioned Exploration

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Prediction with Action: Visual Policy Learning via Joint Denoising Process

AICL: Action In-Context Learning for Video Diffusion Model

VideoAgent: Self-Improving Video Generation

Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

Video Diffusion Alignment via Reward Gradients