DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Minghong Cai,Xiaodong Cun,Xiaoyu Li,Wenze Liu,Zhaoyang Zhang,Yong Zhang,Ying Shan,Xiangyu Yue
2024-12-25
Abstract:Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in multi - prompt long - video generation. Specifically: 1. **Coherence in multi - prompt video generation**: Current video - generation models mainly focus on a single prompt and have difficulty generating coherent scenes with multiple consecutive prompts. When dealing with multiple sequential prompts, these models often have unnatural transitions and weak - prompt - following problems. 2. **Multi - prompt generation without additional training**: Existing multi - prompt video - generation methods usually require a large amount of training data and computational resources, which is impractical in real - world applications. Therefore, the authors propose a multi - prompt video - generation method without additional training. 3. **Complex motion and smooth transitions**: In order to generate more realistic dynamic scenes, especially in cases involving complex motion and significant camera movement, existing methods often fail to maintain stable object motion and seamless semantic transitions. ### Specific solutions in the paper To solve the above problems, the authors propose DiTCtrl, a multi - prompt long - video - generation method based on the Multi - Modal Diffusion Transformer (MM - DiT) architecture. The main contributions of DiTCtrl include: - **KV - sharing mechanism**: By analyzing the attention mechanism of MM - DiT, the authors found that 3D full - attention behaves similarly to the cross/self - attention blocks in UNet - like diffusion models. Based on this, they introduce the KV - sharing mechanism to share key - value pairs between video segments of different prompts to maintain the semantic consistency of key objects. - **Latent - space fusion strategy**: To ensure smooth transitions between different semantic segments, the authors propose a latent - space fusion strategy. Feature fusion is carried out in the overlapping area through a position - dependent weight function, thereby achieving temporal coherence. - **MPVBench benchmark**: To systematically evaluate the effect of multi - prompt video generation, the authors also introduce a new benchmark, MPVBench, which is specifically designed to evaluate the performance of multi - prompt video generation, including diverse transition types and dedicated evaluation metrics. ### Summary Through innovative attention - control mechanisms and latent - space - fusion strategies, DiTCtrl successfully realizes multi - prompt long - video generation without additional training and demonstrates its superior performance in multiple experiments, especially in generating complex motion and smooth transitions.