Abstract:Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. However, they struggle with frame consistency and temporal stability primarily due to the limited scale of video datasets. We observe that pretrained Image2Video diffusion models possess good priors for video dynamics but they lack textual control. Hence, transferring Image2Video models to leverage their video dynamic priors while injecting instruction control to generate controllable videos is both a meaningful and challenging task. To achieve this, we introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions. More specifically, we design a dual query transformer (DQFormer) architecture, which integrates the instructions and frames into the conditional embeddings for future frame prediction. Additionally, we develop Long-Short Term Temporal Adapters and Spatial Adapters that can quickly transfer general video diffusion models to specific scenarios with minimal training costs. Experimental results show that our method significantly outperforms state-of-the-art techniques on four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and UCF-101. Notably, AID achieves 91.2% and 55.5% FVD improvements on Bridge and SSv2 respectively, demonstrating its effectiveness in various domains. More examples can be found at our website <a class="link-external link-https" href="https://chenhsing.github.io/AID" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to predict future frames of a video given the initial frames and text instructions. Specifically, the Text - guided Video Prediction (TVP) task requires predicting future video frames based on the initial frames and an instruction. This task has a wide range of applications in fields such as virtual reality, robotics, and content creation. However, existing TVP methods have difficulties in handling inter - frame consistency and temporal stability, mainly because of the limited scale of available video datasets. Therefore, this paper proposes a new method, aiming to transfer pre - trained image - to - video diffusion models and inject text control to generate controllable and high - quality videos. The main challenges mentioned in the paper include: - **Understanding the initial frames**: How to extract useful information from the initial frames. - **Aligning the initial frames with the text instructions**: How to effectively combine the initial frames with the text instructions. - **Generating consistent future frames**: How to ensure the visual consistency and temporal coherence of the generated future frames. To solve these problems, the authors introduce the Multimodal Large Language Model (MLLM), design the Dual Query Transformer (DQFormer) architecture, and develop the Long - Short Term Temporal Adapters and Spatial Adapters. These techniques work together to enable the model to perform efficient transfer learning on specific datasets while maintaining the quality and stability of the generated videos. The experimental results show that this method significantly outperforms the existing state - of - the - art methods on multiple datasets, especially on the Something Something V2, Epic Kitchen - 100, Bridge Data, and UCF - 101 datasets. For example, on the Bridge Data dataset, the FVD metric of this method is improved by 91.2%, and on the Something Something V2 dataset, it is improved by 55.5%. In conclusion, this paper proposes an effective method. By transferring pre - trained image - to - video diffusion models and injecting text control, it successfully solves the problems of inter - frame consistency and temporal stability in the TVP task, showing great potential in various application scenarios.

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

Seer: Language Instructed Video Prediction with Latent Diffusion Models.

I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models

EffiVED:Efficient Video Editing via Text-instruction Diffusion Models

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Diffusion Models for Video Prediction and Infilling

MV-Diffusion: Motion-aware Video Diffusion Model

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

S$^2$AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance

Video Diffusion Models with Local-Global Context Guidance

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

SimDA: Simple Diffusion Adapter for Efficient Video Generation

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Probabilistic Adaptation of Text-to-Video Models