UniVG: Towards UNIfied-modal Video Generation

Ludan Ruan,Lei Tian,Chuanwei Huang,Xu Zhang,Xinyan Xiao
2024-01-17
Abstract:Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fréchet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to propose a unified video generation framework (Unified-modal Video Generation, abbreviated as UniVG) that can handle various video generation tasks, including video generation based on text, images, or a combination of text and images. Specifically, the paper aims to address the following key issues: 1. **Flexibility Issue**: Existing video generation methods mainly focus on a single objective or task, such as generating videos based solely on text, images, or a combination of both. This approach lacks the flexibility needed to handle scenarios where users may input text and image conditions in different ways. 2. **Adaptability Issue**: In practical applications, users may not be able to provide the necessary text or image conditions, or the provided text-image pairs may conflict, resulting in poor video quality. 3. **Uniformity Issue**: The paper attempts to build a unified system that can handle both text-based and image-based video generation, thereby meeting the needs of different application scenarios. To achieve these goals, the paper makes the following contributions: - **Proposing the UniVG System**: This is a unified framework capable of handling various video generation tasks, including video generation based on text, images, or a combination of both. The system categorizes video generation tasks into high freedom and low freedom types and adopts different strategies for different types of tasks. - **Introducing Biased Gaussian Noise**: For low freedom video generation tasks (such as image animation and video super-resolution), the paper proposes an improved noise model to better preserve the content of the input conditions while maintaining the quality of the generated video. - **Experimental Validation**: Multiple experiments validate the superior performance of UniVG in terms of objective metrics (such as Fréchet Video Distance, FVD) and subjective evaluations. Notably, compared to current state-of-the-art methods, it excels in video quality and consistency with input conditions. In summary, this paper aims to develop a more flexible, general, and efficient video generation technology that can adapt to different input conditions and produce high-quality output videos.