UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

Xuweiyi Chen,Tian Xia,Sihan Xu
2024-03-06
Abstract:Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl's efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of frame consistency in Video Diffusion Models (VDMs) when generating videos. Specifically, ensuring consistency between different frames remains a challenge when using text prompts as control conditions. To solve this problem, the authors propose the **UniCtrl** method, which can be applied to various text-to-video models without additional training to improve the spatiotemporal consistency and motion diversity of generated videos. #### Main Contributions 1. **Cross-Frame Self-Attention Control (SAC)**: - Ensures semantic consistency between different frames through cross-frame self-attention control. - Applies the keys and values of the first frame to the self-attention layer of each frame to achieve semantic consistency. 2. **Motion Injection (MI)**: - Addresses the issue of reduced motion caused by improved consistency. - Introduces two branches during sampling: one for cross-frame self-attention control and the other for retaining the original queries, thereby maintaining motion effects. 3. **Spatiotemporal Synchronization (SS)**: - Copies the latent representation of the output branch as the initial value of the motion branch before each sampling step, further enhancing spatiotemporal consistency. Through experimental validation, the **UniCtrl** method significantly improves the spatiotemporal consistency and motion quality of videos generated by various text-to-video models, demonstrating its effectiveness and generality.