MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation

Sitong Su,Litao Guo,Lianli Gao,Hengtao Shen,Jingkuan Song
DOI: https://doi.org/10.48550/arXiv.2311.16635
2023-11-28
Abstract:Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. Whereas the motion priors are not fully exploited in previous approaches, thus leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic for disregarding motion priors; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To tackle the two issues, we propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero, which derives motion priors from prompts of different objects by Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Furthermore, to facilitate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme which adjusts attention among frames by motion amplitude. Extensive experiments demonstrate that our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two aspects: 1. **Invariant motion - change patterns and independence from prompts**: Existing zero - sample text - to - video generation methods fail to fully utilize the motion prior information in the prompts, resulting in the fixed motion patterns of objects in the generated videos, which cannot change according to the changes in the text prompts. For example, when the prompt is "The plane lands on the runway", if the motion prior information in the prompt is ignored, a scene of the plane taking off may be wrongly generated. 2. **Inaccurate motion control of different objects and mutual entanglement**: Existing methods often adopt a global strategy when controlling the motion of different objects, unable to distinguish between dynamic and static parts, resulting in inaccurate motion control between different objects and prone to confusion. For example, when generating a video containing multiple characters, the motions of different characters may not be independently controlled, resulting in an unnatural overall effect. To address these two challenges, the authors propose a motion control strategy based on prompt adaptability and decoupling, named **MotionZero**. Specifically, MotionZero includes the following key components: - **Extract motion prior**: Utilize a large - language model (LLM) to extract the motion directions of different objects from text prompts. For explicit directional verbs (such as "ski"), the motion direction can be directly inferred through LLM reasoning; for non - explicit directional verbs (such as "walk"), a more accurate motion direction is inferred by generating the first - frame image and combining visual information. - **Decouple motion control**: In the feature space, perform precise motion control on different objects according to the extracted motion prior. Locate the initial object positions through a segmentation model, and in subsequent frames, deform (warp) the features according to the motion direction and position information to ensure that the motions of different objects are independent and natural. - **Motion - aware attention mechanism**: To adapt to the changes in motion amplitudes in different videos, a motion - aware attention mechanism (Motion - Aware Attention) is proposed. This mechanism adjusts the inter - frame attention according to the motion amplitude to ensure natural motion changes in dynamic scenes while maintaining the coherence of static scenes. Through these technical means, MotionZero can generate videos that are highly consistent with text prompts and have natural motions under zero - sample conditions, supporting multiple application scenarios, including zero - sample video editing and human motion control, etc.