LLM-grounded Video Diffusion Models

Long Lian,Baifeng Shi,Adam Yala,Trevor Darrell,Boyi Li
2024-05-05
Abstract:Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper mainly discusses how to better capture complex spatio-temporal dynamics in video generation models under text conditions. Current models often produce restricted or incorrect movements when dealing with fine-grained spatio-temporal cues. To address this issue, the paper proposes the LLM-Grounded Video Diffusion (LVD) model. LVD first uses a large-scale language model (LLM) to generate dynamic scene layouts based on text inputs, and then uses these layouts to guide the video diffusion model in video generation. The study found that LLM can understand complex spatio-temporal dynamics based solely on text and generate layouts that closely align with object motion patterns commonly observed in cues and the real world. LVD guides the video diffusion model by adjusting attention maps, which does not require additional training and can be integrated into any classifier-guided video diffusion model. Experimental results show that LVD significantly outperforms the baseline video diffusion model and several powerful baseline methods in generating videos with desired attributes and motion patterns. The paper also proposes a benchmark test consisting of five tasks to evaluate the alignment between input cues and generated videos. LVD performs well on these tasks, demonstrating its ability to generate high-quality videos highly aligned with text cues. Additionally, evaluations on common datasets such as UCF-101 and MSR-VTT show consistent improvements in LVD. In conclusion, the paper addresses the limitations of existing text-to-video generation models in understanding and generating complex spatio-temporal dynamics by generating dynamic scene layouts using LLM, improving the quality of video generation and the correspondence with text cues.