Abstract:The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoStudio, for consistent-content and multi-scene video generation. Technically, VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoStudio identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoStudio outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference. Source code is available at \url{<a class="link-external link-https" href="https://github.com/FuchenUSTC/VideoStudio" rel="external noopener nofollow">this https URL</a>}.

Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis

VideoStudio: Generating Consistent-Content and Multi-Scene Videos

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models

Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion

Strumming to the Beat: Audio-Conditioned Contrastive Video Textures

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion

Video Creation by Demonstration

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

Video In-context Learning

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Animate Your Motion: Turning Still Images into Dynamic Videos

MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling

Video Diffusion Models with Local-Global Context Guidance

DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

Structure and Content-Guided Video Synthesis with Diffusion Models

Semantically Consistent Video Inpainting with Conditional Diffusion Models

Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control