Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Zhiwu Qing,Shiwei Zhang,Jiayu Wang,Xiang Wang,Yujie Wei,Yingya Zhang,Changxin Gao,Nong Sang
2023-12-08
Abstract:Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to generate videos that are both realistic and diverse, particularly in the context of text-to-video generation (T2V). Although existing diffusion models demonstrate strong capabilities in generating realistic images, they still face challenges in producing videos that are both realistic and diverse. A key reason is that current methods intertwine spatial content and temporal dynamics, significantly increasing the complexity of the text-to-video generation task. Therefore, the paper proposes the HiGen method, which aims to improve performance by decoupling the spatial and temporal factors of videos from both structural and content perspectives. Specifically, the challenges mentioned in the paper include: 1. **Intertwining of spatial and temporal factors**: Existing methods typically mix spatial content and temporal dynamics, increasing the task's complexity. 2. **Quality issues in video generation**: Current T2V methods either perform well in dynamics but have low spatial quality (e.g., ModelScopeT2V) or excel in spatial quality but lack dynamic changes (e.g., Gen-2). 3. **Complex distribution of high-dimensional data**: The high dimensionality and complex distribution of video data make it very difficult to directly generate high-quality videos. To address these challenges, the HiGen method improves the T2V task in the following ways: - **Structural decoupling**: The T2V task is decomposed into two steps: spatial reasoning and temporal reasoning, using a unified denoiser. First, spatially consistent priors are generated based on the text, and then temporally consistent motions are generated from these priors. - **Content decoupling**: Two subtle cues are extracted from the content of the input video, expressing changes in motion and appearance, respectively. These cues are used to guide the model's training, enabling flexible content variation and enhanced temporal stability. Through this hierarchical decoupling approach, HiGen effectively reduces the task's complexity and generates realistic videos with semantic accuracy and motion stability. Experimental results show that HiGen outperforms existing T2V methods on multiple metrics.