AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

Zhen Xing,Qi Dai,Zejia Weng,Zuxuan Wu,Yu-Gang Jiang
2024-06-11
Abstract:Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. However, they struggle with frame consistency and temporal stability primarily due to the limited scale of video datasets. We observe that pretrained Image2Video diffusion models possess good priors for video dynamics but they lack textual control. Hence, transferring Image2Video models to leverage their video dynamic priors while injecting instruction control to generate controllable videos is both a meaningful and challenging task. To achieve this, we introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions. More specifically, we design a dual query transformer (DQFormer) architecture, which integrates the instructions and frames into the conditional embeddings for future frame prediction. Additionally, we develop Long-Short Term Temporal Adapters and Spatial Adapters that can quickly transfer general video diffusion models to specific scenarios with minimal training costs. Experimental results show that our method significantly outperforms state-of-the-art techniques on four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and UCF-101. Notably, AID achieves 91.2% and 55.5% FVD improvements on Bridge and SSv2 respectively, demonstrating its effectiveness in various domains. More examples can be found at our website <a class="link-external link-https" href="https://chenhsing.github.io/AID" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict future frames of a video given the initial frames and text instructions. Specifically, the Text - guided Video Prediction (TVP) task requires predicting future video frames based on the initial frames and an instruction. This task has a wide range of applications in fields such as virtual reality, robotics, and content creation. However, existing TVP methods have difficulties in handling inter - frame consistency and temporal stability, mainly because of the limited scale of available video datasets. Therefore, this paper proposes a new method, aiming to transfer pre - trained image - to - video diffusion models and inject text control to generate controllable and high - quality videos. The main challenges mentioned in the paper include: - **Understanding the initial frames**: How to extract useful information from the initial frames. - **Aligning the initial frames with the text instructions**: How to effectively combine the initial frames with the text instructions. - **Generating consistent future frames**: How to ensure the visual consistency and temporal coherence of the generated future frames. To solve these problems, the authors introduce the Multimodal Large Language Model (MLLM), design the Dual Query Transformer (DQFormer) architecture, and develop the Long - Short Term Temporal Adapters and Spatial Adapters. These techniques work together to enable the model to perform efficient transfer learning on specific datasets while maintaining the quality and stability of the generated videos. The experimental results show that this method significantly outperforms the existing state - of - the - art methods on multiple datasets, especially on the Something Something V2, Epic Kitchen - 100, Bridge Data, and UCF - 101 datasets. For example, on the Bridge Data dataset, the FVD metric of this method is improved by 91.2%, and on the Something Something V2 dataset, it is improved by 55.5%. In conclusion, this paper proposes an effective method. By transferring pre - trained image - to - video diffusion models and injecting text control, it successfully solves the problems of inter - frame consistency and temporal stability in the TVP task, showing great potential in various application scenarios.