FlexiFilm: Long Video Generation with Flexible Conditions

Yichen Ouyang,jianhao Yuan,Hao Zhao,Gaoang Wang,Bo zhao

2024-04-29

Abstract:Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses. Project page:

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address a series of challenges that arise when generating long videos. Specifically, existing video generation methods based on diffusion models perform well for short videos but face the following two major issues when generating long videos: 1. **Temporal Consistency Issue**: Due to insufficient conditioning mechanisms, existing models struggle to handle complex dynamic changes and semantic information in long videos, resulting in poor temporal consistency in the generated videos. 2. **Overexposure Issue**: The noise scheduling strategy fails to ensure that the final signal-to-noise ratio (SNR) is zero during the denoising process, leading to overexposure or structural collapse in the generated videos. To address these issues, the paper proposes the FlexiFilm model, a diffusion model specifically designed for long video generation. The model includes two main components: - **Temporal Conditioner**: Used to establish a more consistent relationship between the generated frames and multimodal conditions. - **Resampling Strategy**: Used to address the non-zero SNR issue in multiple rounds of inference, thereby improving the quality and consistency of the generated videos. Experimental validation shows that FlexiFilm outperforms existing baseline methods in generating high-quality, consistent long videos of over 30 seconds.

FlexiFilm: Long Video Generation with Flexible Conditions

Flexible Diffusion Modeling of Long Videos

FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

Video Diffusion Models

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Progressive Autoregressive Video Diffusion Models

Controllable Longer Image Animation with Diffusion Models

Frame Interpolation with Consecutive Brownian Bridge Diffusion

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Efficient and consistent zero-shot video generation with diffusion models

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

FIFO-Diffusion: Generating Infinite Videos from Text without Training

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Towards Chunk-Wise Generation for Long Videos

Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion

FlexGen: Flexible Multi-View Generation from Text and Image Inputs

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

Efficient Video Segmentation Models with Per-frame Inference