Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Penghui Ruan,Pichao Wang,Divya Saxena,Jiannong Cao,Yuhui Shi
2024-11-01
Abstract:Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: <a class="link-external link-https" href="https://PR-Ryan.github.io/DEMO-project/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of generating realistic and complex motion videos in the Text-to-Video (T2V) task. Current T2V models typically generate static or minimally dynamic outputs, failing to capture the complex motions described in the text. This issue mainly stems from two challenges: 1. **Insufficient motion representation in text encoding**: Existing T2V models use large-scale vision-language models (such as CLIP) as text encoders. These models are very effective at capturing static elements and spatial relationships but perform poorly in encoding dynamic motions. This is mainly because their training focuses more on recognizing nouns and objects, while the representation of verbs and actions is not accurate enough. 2. **Relying solely on spatial text conditions**: Existing models usually apply text information to the video generation task frame by frame through a spatial cross-attention mechanism. This method is very effective for generating high-quality static images, but for videos, motion is a key component that spans both time and space dimensions. Therefore, this method is insufficient to generate videos with realistic motion dynamics. To address these issues, the authors propose a new framework named DEcomposed MOtion (DEMO), which enhances motion synthesis by decomposing text encoding and conditioning mechanisms into content and motion components. Specifically, DEMO includes a content encoder for static elements, a motion encoder for temporal dynamics, and introduces content and motion conditioning mechanisms respectively. Additionally, DEMO introduces text-motion supervision and video-motion supervision to improve the model's understanding and generation of motion.