Abstract:Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. Whereas the motion priors are not fully exploited in previous approaches, thus leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic for disregarding motion priors; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To tackle the two issues, we propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero, which derives motion priors from prompts of different objects by Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Furthermore, to facilitate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme which adjusts attention among frames by motion amplitude. Extensive experiments demonstrate that our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two aspects: 1. **Invariant motion - change patterns and independence from prompts**: Existing zero - sample text - to - video generation methods fail to fully utilize the motion prior information in the prompts, resulting in the fixed motion patterns of objects in the generated videos, which cannot change according to the changes in the text prompts. For example, when the prompt is "The plane lands on the runway", if the motion prior information in the prompt is ignored, a scene of the plane taking off may be wrongly generated. 2. **Inaccurate motion control of different objects and mutual entanglement**: Existing methods often adopt a global strategy when controlling the motion of different objects, unable to distinguish between dynamic and static parts, resulting in inaccurate motion control between different objects and prone to confusion. For example, when generating a video containing multiple characters, the motions of different characters may not be independently controlled, resulting in an unnatural overall effect. To address these two challenges, the authors propose a motion control strategy based on prompt adaptability and decoupling, named **MotionZero**. Specifically, MotionZero includes the following key components: - **Extract motion prior**: Utilize a large - language model (LLM) to extract the motion directions of different objects from text prompts. For explicit directional verbs (such as "ski"), the motion direction can be directly inferred through LLM reasoning; for non - explicit directional verbs (such as "walk"), a more accurate motion direction is inferred by generating the first - frame image and combining visual information. - **Decouple motion control**: In the feature space, perform precise motion control on different objects according to the extracted motion prior. Locate the initial object positions through a segmentation model, and in subsequent frames, deform (warp) the features according to the motion direction and position information to ensure that the motions of different objects are independent and natural. - **Motion - aware attention mechanism**: To adapt to the changes in motion amplitudes in different videos, a motion - aware attention mechanism (Motion - Aware Attention) is proposed. This mechanism adjusts the inter - frame attention according to the motion amplitude to ensure natural motion changes in dynamic scenes while maintaining the coherence of static scenes. Through these technical means, MotionZero can generate videos that are highly consistent with text prompts and have natural motions under zero - sample conditions, supporting multiple application scenarios, including zero - sample video editing and human motion control, etc.

MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation

Motion Prompting: Controlling Video Generation with Motion Trajectories

Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

Motion Control for Enhanced Complex Action Video Generation

Searching Priors Makes Text-to-Video Synthesis Better

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

MotionBooth: Motion-Aware Customized Text-to-Video Generation

Zero-Shot Video Editing through Adaptive Sliding Score Distillation

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

COMD: Training-free Video Motion Transfer with Camera-Object Motion Disentanglement

MotionCraft: Physics-based Zero-Shot Video Generation

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Prompt-based Zero-shot Video Moment Retrieval

Zero-shot High-fidelity and Pose-controllable Character Animation

ViMo: Generating Motions from Casual Videos

Human Motion Transfer from Poses in the Wild