Abstract:Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in multi - prompt long - video generation. Specifically: 1. **Coherence in multi - prompt video generation**: Current video - generation models mainly focus on a single prompt and have difficulty generating coherent scenes with multiple consecutive prompts. When dealing with multiple sequential prompts, these models often have unnatural transitions and weak - prompt - following problems. 2. **Multi - prompt generation without additional training**: Existing multi - prompt video - generation methods usually require a large amount of training data and computational resources, which is impractical in real - world applications. Therefore, the authors propose a multi - prompt video - generation method without additional training. 3. **Complex motion and smooth transitions**: In order to generate more realistic dynamic scenes, especially in cases involving complex motion and significant camera movement, existing methods often fail to maintain stable object motion and seamless semantic transitions. ### Specific solutions in the paper To solve the above problems, the authors propose DiTCtrl, a multi - prompt long - video - generation method based on the Multi - Modal Diffusion Transformer (MM - DiT) architecture. The main contributions of DiTCtrl include: - **KV - sharing mechanism**: By analyzing the attention mechanism of MM - DiT, the authors found that 3D full - attention behaves similarly to the cross/self - attention blocks in UNet - like diffusion models. Based on this, they introduce the KV - sharing mechanism to share key - value pairs between video segments of different prompts to maintain the semantic consistency of key objects. - **Latent - space fusion strategy**: To ensure smooth transitions between different semantic segments, the authors propose a latent - space fusion strategy. Feature fusion is carried out in the overlapping area through a position - dependent weight function, thereby achieving temporal coherence. - **MPVBench benchmark**: To systematically evaluate the effect of multi - prompt video generation, the authors also introduce a new benchmark, MPVBench, which is specifically designed to evaluate the performance of multi - prompt video generation, including diverse transition types and dedicated evaluation metrics. ### Summary Through innovative attention - control mechanisms and latent - space - fusion strategies, DiTCtrl successfully realizes multi - prompt long - video generation without additional training and demonstrates its superior performance in multiple experiments, especially in generating complex motion and smooth transitions.

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

DiVE: DiT-based Video Generation with Enhanced Control

Video-P2P: Video Editing with Cross-attention Control

ControlVideo: Training-free Controllable Text-to-Video Generation

Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

VDT: General-purpose Video Diffusion Transformers via Mask Modeling

UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling

MPT: Multi-grained Prompt Tuning for Text-Video Retrieval

CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Video Creation by Demonstration

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

PromptCoT: Align Prompt Distribution Via Adapted Chain-of-Thought

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval