AVID: Adapting Video Diffusion Models to World Models

Marc Rigter,Tarun Gupta,Agrin Hilmkil,Chao Ma
2024-10-01
Abstract:Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.1 Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem of how to adapt pre-trained video diffusion models into action-conditioned world models without accessing the pre-trained model parameters. Specifically, the authors propose a method called AVID, which modifies the intermediate outputs of the pre-trained model by training an adapter on a small amount of labeled video data from a specific domain, thereby generating accurate action-conditioned videos. ### Background and Challenges 1. **Success of Large-Scale Generative Models**: - Large-scale generative models have achieved significant success in multiple domains, but extending foundational models for decision-making remains a challenge in sequential decision problems (such as robotics) due to the scarcity of labeled action data. 2. **Utilization of Unlabeled Videos**: - Utilizing widely available unlabeled videos to train world models that can simulate the consequences of actions and be used to optimize decision-making in downstream tasks. 3. **Limitations of Existing Models**: - Current image-to-video diffusion models can generate highly realistic synthetic videos, but they are not action-conditioned, and the most powerful models are typically closed-source and cannot be fine-tuned. ### Solution 1. **AVID Method**: - AVID modifies the intermediate outputs of the pre-trained model by training an adapter on a small amount of labeled video data from a specific domain, thereby generating accurate action-conditioned videos. - The adapter uses a learned mask to modify the intermediate outputs of the pre-trained model and generate accurate action-conditioned videos. 2. **Main Contributions**: - Proposes a method to adapt pre-trained video diffusion models into action-conditioned world models without accessing the pre-trained model parameters. - Analyzes the limitations of the adaptation method proposed by Yang et al. (2024b). - Introduces AVID, a novel method for adapting pre-trained diffusion models by applying a learned mask to combine the outputs of the pre-trained model with the conditional outputs of the domain-specific adapter. ### Experimental Results 1. **Evaluation Datasets**: - Evaluated AVID on video game data and real-world robotics data. - Used a model with 140 million parameters as the pre-trained model. 2. **Performance Comparison**: - AVID outperformed existing baseline methods on all evaluation metrics. - Particularly, AVID showed significantly better performance than other methods with smaller model sizes. ### Conclusion This paper demonstrates that the AVID method can effectively adapt pre-trained video diffusion models into action-conditioned world models without accessing the pre-trained model parameters, thereby generating accurate action-conditioned videos. This provides new possibilities for leveraging large-scale pre-trained models in resource-constrained environments.