UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

Xuweiyi Chen,Tian Xia,Sihan Xu

2024-03-06

Abstract:Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl's efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of frame consistency in Video Diffusion Models (VDMs) when generating videos. Specifically, ensuring consistency between different frames remains a challenge when using text prompts as control conditions. To solve this problem, the authors propose the **UniCtrl** method, which can be applied to various text-to-video models without additional training to improve the spatiotemporal consistency and motion diversity of generated videos. #### Main Contributions 1. **Cross-Frame Self-Attention Control (SAC)**: - Ensures semantic consistency between different frames through cross-frame self-attention control. - Applies the keys and values of the first frame to the self-attention layer of each frame to achieve semantic consistency. 2. **Motion Injection (MI)**: - Addresses the issue of reduced motion caused by improved consistency. - Introduces two branches during sampling: one for cross-frame self-attention control and the other for retaining the original queries, thereby maintaining motion effects. 3. **Spatiotemporal Synchronization (SS)**: - Copies the latent representation of the output branch as the initial value of the motion branch before each sampling step, further enhancing spatiotemporal consistency. Through experimental validation, the **UniCtrl** method significantly improves the spatiotemporal consistency and motion quality of videos generated by various text-to-video models, demonstrating its effectiveness and generality.

UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

ControlVideo: Training-free Controllable Text-to-Video Generation

Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

Edit Temporal-Consistent Videos with Image Diffusion Model

EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

FLATTEN: Optical FLow-guided ATTENtion for Consistent Text-to-video Editing

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Blended Latent Diffusion under Attention Control for Real-World Video Editing

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models

UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation