Abstract:Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. This paper introduces Mojito, a diffusion model that incorporates both \textbf{Mo}tion tra\textbf{j}ectory and \textbf{i}ntensi\textbf{t}y contr\textbf{o}l for text to video generation. Specifically, Mojito features a Directional Motion Control module that leverages cross-attention to efficiently direct the generated object's motion without additional training, alongside a Motion Intensity Modulator that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in video generation, how to effectively integrate directional guidance and controllable motion intensity. Specifically, although existing diffusion models can generate high - quality video content, there are still challenges in efficiently training models that can integrate directional guidance and controllable motion intensity. These problems include: 1. **Complex relative motion data capture**: In the real - world videos, the simultaneous movement of cameras and objects makes capturing relative motion data complex. 2. **Lack of large - scale labeled datasets**: Existing video datasets rarely contain detailed motion - dynamic labels, and obtaining labels for these subtle aspects is both expensive and time - consuming. 3. **High demand for computing resources**: Training models with detailed labels requires a large amount of computing resources. To solve these problems, the paper introduces the Mojito model, which can simultaneously integrate trajectory direction and motion intensity control during text - to - video generation. Mojito achieves this goal through the following two core modules: - **Directional Motion Control (DMC) module**: Using the cross - attention mechanism, it can adjust the motion direction of the generated object without additional training during the inference stage, making its trajectory align with the specified path. - **Motion Intensity Modulator (MIM) module**: Encodes any motion intensity into features and seamlessly integrates it into the diffusion framework, thereby achieving precise control of motion intensity. In addition, Mojito also explores a method of using a global motion intensity embedding layer as a conditional input to further enhance the control of motion intensity. Through extensive experiments, the paper demonstrates the effectiveness of Mojito in achieving precise trajectory and intensity control. The generated motion patterns highly match the specified directions and intensities, providing a realistic dynamic effect that conforms to the laws of natural motion.

Mojito: Motion Trajectory and Intensity Control for Video Generation

Motion Prompting: Controlling Video Generation with Motion Trajectories

Animate Your Motion: Turning Still Images into Dynamic Videos

Motion-Conditioned Diffusion Model for Controllable Video Synthesis

MoVideo: Motion-Aware Video Generation with Diffusion Models

Controllable Longer Image Animation with Diffusion Models

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Video Diffusion Models are Training-free Motion Interpreter and Controller

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Motion Control for Enhanced Complex Action Video Generation

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

COMD: Training-free Video Motion Transfer with Camera-Object Motion Disentanglement

Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation

ViMo: Generating Motions from Casual Videos

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

MV-Diffusion: Motion-aware Video Diffusion Model

MotionCrafter: One-Shot Motion Customization of Diffusion Models

MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion