Abstract:Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to generate videos that are both realistic and diverse, particularly in the context of text-to-video generation (T2V). Although existing diffusion models demonstrate strong capabilities in generating realistic images, they still face challenges in producing videos that are both realistic and diverse. A key reason is that current methods intertwine spatial content and temporal dynamics, significantly increasing the complexity of the text-to-video generation task. Therefore, the paper proposes the HiGen method, which aims to improve performance by decoupling the spatial and temporal factors of videos from both structural and content perspectives. Specifically, the challenges mentioned in the paper include: 1. **Intertwining of spatial and temporal factors**: Existing methods typically mix spatial content and temporal dynamics, increasing the task's complexity. 2. **Quality issues in video generation**: Current T2V methods either perform well in dynamics but have low spatial quality (e.g., ModelScopeT2V) or excel in spatial quality but lack dynamic changes (e.g., Gen-2). 3. **Complex distribution of high-dimensional data**: The high dimensionality and complex distribution of video data make it very difficult to directly generate high-quality videos. To address these challenges, the HiGen method improves the T2V task in the following ways: - **Structural decoupling**: The T2V task is decomposed into two steps: spatial reasoning and temporal reasoning, using a unified denoiser. First, spatially consistent priors are generated based on the text, and then temporally consistent motions are generated from these priors. - **Content decoupling**: Two subtle cues are extracted from the content of the input video, expressing changes in motion and appearance, respectively. These cues are used to guide the model's training, enabling flexible content variation and enhanced temporal stability. Through this hierarchical decoupling approach, HiGen effectively reduces the task's complexity and generates realistic videos with semantic accuracy and motion stability. Experimental results show that HiGen outperforms existing T2V methods on multiple metrics.

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

Decoupled Video Generation with Chain of Training-free Diffusion Model Experts

Latent Video Diffusion Models for High-Fidelity Long Video Generation

VideoTetris: Towards Compositional Text-to-Video Generation

ED-T2V: an Efficient Training Framework for Diffusion-based Text-to-Video Generation.

VEnhancer: Generative Space-Time Enhancement for Video Generation

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Compositional Video Generation as Flow Equalization

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

Dual-Stream Diffusion Net for Text-to-Video Generation

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

GVDIFF: Grounded Text-to-Video Generation with Diffusion Models

Imagen Video: High Definition Video Generation with Diffusion Models