Abstract:Existing text-to-video (T2V) models often struggle with generating videos with sufficiently pronounced or complex actions. A key limitation lies in the text prompt's inability to precisely convey intricate motion details. To address this, we propose a novel framework, MVideo, designed to produce long-duration videos with precise, fluid actions. MVideo overcomes the limitations of text prompts by incorporating mask sequences as an additional motion condition input, providing a clearer, more accurate representation of intended actions. Leveraging foundational vision models such as GroundingDINO and SAM2, MVideo automatically generates mask sequences, enhancing both efficiency and robustness. Our results demonstrate that, after training, MVideo effectively aligns text prompts with motion conditions to produce videos that simultaneously meet both criteria. This dual control mechanism allows for more dynamic video generation by enabling alterations to either the text prompt or motion condition independently, or both in tandem. Furthermore, MVideo supports motion condition editing and composition, facilitating the generation of videos with more complex actions. MVideo thus advances T2V motion generation, setting a strong benchmark for improved action depiction in current video diffusion models. Our project page is available at <a class="link-external link-https" href="https://mvideo-v1.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the challenges faced by existing Text-to-Video (T2V) generation models in generating videos with complex actions. Specifically, existing T2V models often struggle to generate videos with sufficiently significant or complex actions, primarily because text prompts cannot accurately convey intricate motion details. To overcome this limitation, the paper proposes a new framework—MVideo, which introduces mask sequences as additional motion condition inputs to generate long-duration videos with precise and smooth actions. ### Main Contributions 1. **Introduction of the MVideo Framework**: MVideo integrates additional motion conditions (mask sequences) to iteratively generate long-duration action videos, achieving precise motion control. 2. **Generalization Capability**: MVideo can align with unseen motion conditions and generate more complex videos by editing or combining motion conditions. 3. **Validation of Effectiveness**: The effectiveness of MVideo is validated through quantitative and visual comparisons with state-of-the-art video diffusion methods. ### Solution 1. **Mask Sequences**: MVideo utilizes mask sequences as additional condition inputs, which can more accurately represent the intended actions. Mask sequences are automatically generated using foundational vision models (such as GroundingDINO and SAM2), improving efficiency and robustness. 2. **Long Video Generation**: MVideo proposes an efficient recursive video generation method that combines image conditions and low-resolution video conditions, reducing computational costs while maintaining temporal consistency, ensuring content coherence and action sequence consistency in long-duration videos. 3. **Consistency Loss**: A new consistency loss is introduced during training to retain the original model's text alignment capability while learning mask sequence alignment. ### Experimental Results - **Text Alignment Performance**: MVideo's overall consistency, image quality, and action smoothness metrics on the VBench test set are comparable to existing models, demonstrating its strong text-to-video generation capability. - **Mask Sequence Alignment Performance**: MVideo shows strong generalization capability on unseen object mask sequences, particularly in complex action scenarios, achieving high-precision mask alignment. - **Case Studies**: Visual comparisons with existing T2V models demonstrate MVideo's significant advantages in generating complex action videos, including changing background scenes, moving objects, and camera movements. ### Conclusion By introducing mask sequences as additional motion condition inputs, MVideo addresses the limitations of existing T2V models in generating complex action videos, achieving more precise and coherent video generation. Experimental results show that MVideo not only excels in mask sequence alignment but also generates complex action videos, indicating its broad application prospects.

Motion Control for Enhanced Complex Action Video Generation

Motion Prompting: Controlling Video Generation with Motion Trajectories

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

ViMo: Generating Motions from Casual Videos

Text-Animator: Controllable Visual Text Video Generation

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

MotionBooth: Motion-Aware Customized Text-to-Video Generation

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

I2VControl: Disentangled and Unified Video Motion Synthesis Control

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling

ControlMM: Controllable Masked Motion Generation

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

ControlVideo: Training-free Controllable Text-to-Video Generation

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

MoVideo: Motion-Aware Video Generation with Diffusion Models