Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Penghui Ruan,Pichao Wang,Divya Saxena,Jiannong Cao,Yuhui Shi

2024-11-01

Abstract:Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: <a class="link-external link-https" href="https://PR-Ryan.github.io/DEMO-project/" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the problem of generating realistic and complex motion videos in the Text-to-Video (T2V) task. Current T2V models typically generate static or minimally dynamic outputs, failing to capture the complex motions described in the text. This issue mainly stems from two challenges: 1. **Insufficient motion representation in text encoding**: Existing T2V models use large-scale vision-language models (such as CLIP) as text encoders. These models are very effective at capturing static elements and spatial relationships but perform poorly in encoding dynamic motions. This is mainly because their training focuses more on recognizing nouns and objects, while the representation of verbs and actions is not accurate enough. 2. **Relying solely on spatial text conditions**: Existing models usually apply text information to the video generation task frame by frame through a spatial cross-attention mechanism. This method is very effective for generating high-quality static images, but for videos, motion is a key component that spans both time and space dimensions. Therefore, this method is insufficient to generate videos with realistic motion dynamics. To address these issues, the authors propose a new framework named DEcomposed MOtion (DEMO), which enhances motion synthesis by decomposing text encoding and conditioning mechanisms into content and motion components. Specifically, DEMO includes a content encoder for static elements, a motion encoder for temporal dynamics, and introduces content and motion conditioning mechanisms respectively. Additionally, DEMO introduces text-motion supervision and video-motion supervision to improve the model's understanding and generation of motion.

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Motion Control for Enhanced Complex Action Video Generation

Motion Prompting: Controlling Video Generation with Motion Trajectories

Text-Animator: Controllable Visual Text Video Generation

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Animate Your Motion: Turning Still Images into Dynamic Videos

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

MotionBooth: Motion-Aware Customized Text-to-Video Generation

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

VideoTetris: Towards Compositional Text-to-Video Generation

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

Textual Decomposition Then Sub-motion-space Scattering for Open-Vocabulary Motion Generation

Searching Priors Makes Text-to-Video Synthesis Better

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Text-driven Video Prediction