LivePhoto: Real Image Animation with Text-guided Motion Control

Xi Chen,Zhiheng Liu,Mengting Chen,Yutong Feng,Yu Liu,Yujun Shen,Hengshuang Zhao

2023-12-06

Abstract:Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address a problem in existing text-to-video generation techniques: current research typically focuses only on controlling the spatial content of the video, while neglecting precise control over temporal motion (such as actions or camera movements). To tackle this challenge, the paper proposes a practical system called LivePhoto, which allows users to animate images of interest through text descriptions. Specifically, LivePhoto achieves this goal through the following points: 1. **Establishing a strong baseline**: First, an already trained text-to-image generator (e.g., Stable Diffusion) is improved to accept real images as additional input. 2. **Introducing a motion module**: A motion module is added to the improved generator for temporal modeling, and a carefully planned training process is designed to better link text with motion. 3. **Reducing the ambiguity of text-to-motion mapping**: Given that text can only roughly describe motion (e.g., without considering details like movement speed) and may contain both content and motion descriptions, a motion intensity estimation module and a text re-weighting module are introduced to reduce this ambiguity. Experiments demonstrate that LivePhoto can effectively decode motion-related text instructions into videos, including actions, camera movements, and even the creation of new content from scratch (e.g., pouring water into an empty cup). Additionally, due to the introduction of the motion intensity learning mechanism, the system provides users with an extra control signal (i.e., motion intensity), giving users more flexibility when customizing videos.

LivePhoto: Real Image Animation with Text-guided Motion Control

Text-driven Visual Prosody Generation for Embodied Conversational Agents

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

MotionBooth: Motion-Aware Customized Text-to-Video Generation

Text-Animator: Controllable Visual Text Video Generation

Animate124: Animating One Image to 4D Dynamic Scene

Motion Prompting: Controlling Video Generation with Motion Trajectories

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Animate Your Motion: Turning Still Images into Dynamic Videos

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Motion Control for Enhanced Complex Action Video Generation

LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Text2Performer: Text-Driven Human Video Generation.

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Controllable Longer Image Animation with Diffusion Models