Mojito: Motion Trajectory and Intensity Control for Video Generation

Xuehai He,Shuohang Wang,Jianwei Yang,Xiaoxia Wu,Yiping Wang,Kuan Wang,Zheng Zhan,Olatunji Ruwase,Yelong Shen,Xin Eric Wang
2024-12-12
Abstract:Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. This paper introduces Mojito, a diffusion model that incorporates both \textbf{Mo}tion tra\textbf{j}ectory and \textbf{i}ntensi\textbf{t}y contr\textbf{o}l for text to video generation. Specifically, Mojito features a Directional Motion Control module that leverages cross-attention to efficiently direct the generated object's motion without additional training, alongside a Motion Intensity Modulator that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in video generation, how to effectively integrate directional guidance and controllable motion intensity. Specifically, although existing diffusion models can generate high - quality video content, there are still challenges in efficiently training models that can integrate directional guidance and controllable motion intensity. These problems include: 1. **Complex relative motion data capture**: In the real - world videos, the simultaneous movement of cameras and objects makes capturing relative motion data complex. 2. **Lack of large - scale labeled datasets**: Existing video datasets rarely contain detailed motion - dynamic labels, and obtaining labels for these subtle aspects is both expensive and time - consuming. 3. **High demand for computing resources**: Training models with detailed labels requires a large amount of computing resources. To solve these problems, the paper introduces the Mojito model, which can simultaneously integrate trajectory direction and motion intensity control during text - to - video generation. Mojito achieves this goal through the following two core modules: - **Directional Motion Control (DMC) module**: Using the cross - attention mechanism, it can adjust the motion direction of the generated object without additional training during the inference stage, making its trajectory align with the specified path. - **Motion Intensity Modulator (MIM) module**: Encodes any motion intensity into features and seamlessly integrates it into the diffusion framework, thereby achieving precise control of motion intensity. In addition, Mojito also explores a method of using a global motion intensity embedding layer as a conditional input to further enhance the control of motion intensity. Through extensive experiments, the paper demonstrates the effectiveness of Mojito in achieving precise trajectory and intensity control. The generated motion patterns highly match the specified directions and intensities, providing a realistic dynamic effect that conforms to the laws of natural motion.