LLM-grounded Video Diffusion Models

Long Lian,Baifeng Shi,Adam Yala,Trevor Darrell,Boyi Li

2024-05-05

Abstract:Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

This paper mainly discusses how to better capture complex spatio-temporal dynamics in video generation models under text conditions. Current models often produce restricted or incorrect movements when dealing with fine-grained spatio-temporal cues. To address this issue, the paper proposes the LLM-Grounded Video Diffusion (LVD) model. LVD first uses a large-scale language model (LLM) to generate dynamic scene layouts based on text inputs, and then uses these layouts to guide the video diffusion model in video generation. The study found that LLM can understand complex spatio-temporal dynamics based solely on text and generate layouts that closely align with object motion patterns commonly observed in cues and the real world. LVD guides the video diffusion model by adjusting attention maps, which does not require additional training and can be integrated into any classifier-guided video diffusion model. Experimental results show that LVD significantly outperforms the baseline video diffusion model and several powerful baseline methods in generating videos with desired attributes and motion patterns. The paper also proposes a benchmark test consisting of five tasks to evaluate the alignment between input cues and generated videos. LVD performs well on these tasks, demonstrating its ability to generate high-quality videos highly aligned with text cues. Additionally, evaluations on common datasets such as UCF-101 and MSR-VTT show consistent improvements in LVD. In conclusion, the paper addresses the limitations of existing text-to-video generation models in understanding and generating complex spatio-temporal dynamics by generating dynamic scene layouts using LLM, improving the quality of video generation and the correspondence with text cues.

LLM-grounded Video Diffusion Models

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

LaMD: Latent Motion Diffusion for Video Generation

Video Diffusion Models with Local-Global Context Guidance

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Video Diffusion Models for High-Fidelity Long Video Generation

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Exploring Iterative Refinement with Diffusion Models for Video Grounding

GVDIFF: Grounded Text-to-Video Generation with Diffusion Models

MoVideo: Motion-Aware Video Generation with Diffusion Models

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

LLM4VG: Large Language Models Evaluation for Video Grounding

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation

Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge

MV-Diffusion: Motion-aware Video Diffusion Model