Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Xiaoyu Shi,Zhaoyang Huang,Fu-Yun Wang,Weikang Bian,Dasong Li,Yi Zhang,Manyuan Zhang,Ka Chun Cheung,Simon See,Hongwei Qin,Jifeng Dai,Hongsheng Li

2024-01-31

Abstract:We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper introduces Motion-I2V, a new framework for generating consistent and controllable video from still images. Unlike previous methods that directly learn the complex mapping from images to videos, Motion-I2V decomposes this process into two stages, using explicit motion modeling. In the first stage, a diffusion-based motion field predictor is proposed to infer the trajectories of pixels in the reference image. In the second stage, motion-enhanced temporal attention is employed to enhance the limited 1-D temporal attention in the potential diffusion model, effectively propagating the features of the reference image to the synthesized frames guided by the predicted trajectories in the first stage. Compared to existing methods, Motion-I2V is able to generate more consistent videos even under large motions and viewpoint changes, and supports user precise control of motion trajectories and animation regions through training a sparse trajectory control network. Additionally, the second stage of Motion-I2V also supports zero-shot translation from video to video. Experimental results demonstrate that Motion-I2V outperforms previous methods in terms of consistency and controllability.

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

I2VControl: Disentangled and Unified Video Motion Synthesis Control

ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Motion Control for Enhanced Complex Action Video Generation

Decouple Content and Motion for Conditional Image-to-Video Generation

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

MoVideo: Motion-Aware Video Generation with Diffusion Models

Controllable Longer Image Animation with Diffusion Models

Motion Prompting: Controlling Video Generation with Motion Trajectories

Motion-Conditioned Diffusion Model for Controllable Video Synthesis

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Animate Your Motion: Turning Still Images into Dynamic Videos

I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

Motion Inversion for Video Customization

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

CamI2V: Camera-Controlled Image-to-Video Diffusion Model