Abstract:Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fréchet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit

What problem does this paper attempt to address?

The main goal of this paper is to propose a unified video generation framework (Unified-modal Video Generation, abbreviated as UniVG) that can handle various video generation tasks, including video generation based on text, images, or a combination of text and images. Specifically, the paper aims to address the following key issues: 1. **Flexibility Issue**: Existing video generation methods mainly focus on a single objective or task, such as generating videos based solely on text, images, or a combination of both. This approach lacks the flexibility needed to handle scenarios where users may input text and image conditions in different ways. 2. **Adaptability Issue**: In practical applications, users may not be able to provide the necessary text or image conditions, or the provided text-image pairs may conflict, resulting in poor video quality. 3. **Uniformity Issue**: The paper attempts to build a unified system that can handle both text-based and image-based video generation, thereby meeting the needs of different application scenarios. To achieve these goals, the paper makes the following contributions: - **Proposing the UniVG System**: This is a unified framework capable of handling various video generation tasks, including video generation based on text, images, or a combination of both. The system categorizes video generation tasks into high freedom and low freedom types and adopts different strategies for different types of tasks. - **Introducing Biased Gaussian Noise**: For low freedom video generation tasks (such as image animation and video super-resolution), the paper proposes an improved noise model to better preserve the content of the input conditions while maintaining the quality of the generated video. - **Experimental Validation**: Multiple experiments validate the superior performance of UniVG in terms of objective metrics (such as Fréchet Video Distance, FVD) and subjective evaluations. Notably, compared to current state-of-the-art methods, it excels in video quality and consistency with input conditions. In summary, this paper aims to develop a more flexible, general, and efficient video generation technology that can adapt to different input conditions and produce high-quality output videos.

UniVG: Towards UNIfied-modal Video Generation

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval

GenRec: Unifying Video Generation and Recognition with Diffusion Models

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

ControlVideo: Training-free Controllable Text-to-Video Generation

TVG: A Training-free Transition Video Generation Method with Diffusion Models

Video Generation Beyond a Single Clip

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Diverse Video Generation from a Single Video

Video Diffusion Models

OmniVid: A Generative Framework for Universal Video Understanding

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Imagen Video: High Definition Video Generation with Diffusion Models