FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Jiasong Feng,Ao Ma,Jing Wang,Bo Cheng,Xiaodan Liang,Dawei Leng,Yuhui Yin
2024-08-16
Abstract:Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Lastly, TFB boosts the temporal consistency of latent features. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our video demo, code and model are available at <a class="link-external link-https" href="https://360cvgroup.github.io/FancyVideo/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the challenges in Text-to-Video (T2V) generation, particularly when it comes to generating videos with coherent actions and long-term consistency. Existing T2V models struggle to maintain spatiotemporal logical consistency and action coherence when dealing with long videos, primarily because they typically use spatial cross-attention mechanisms for text control, which lack specific textual guidance for each frame. To solve this problem, the paper proposes a new method called FancyVideo, whose core contributions include: 1. **Cross-frame Textual Guidance Module (CTGM)**: This is an innovative design to improve existing text control mechanisms. CTGM consists of three sub-modules: - **Temporal Information Injector (TII)**: It injects time-specific information from latent features into the text conditions, thereby obtaining cross-frame text conditions. - **Temporal Affinity Refiner (TAR)**: It refines the correlation matrix between cross-frame text conditions and latent features along the temporal dimension, adjusting the temporal logic of text guidance. - **Temporal Feature Booster (TFB)**: Further enhances the temporal consistency of latent features. 2. **FancyVideo**: As the first work to deeply explore the T2V task with cross-frame textual guidance, this method significantly improves the dynamism and consistency of videos through the aforementioned CTGM. 3. **Experimental Validation**: The paper demonstrates the effectiveness of FancyVideo through a series of quantitative and qualitative evaluations, including state-of-the-art results on the EvalCrafter benchmark and competitive performance on the UCF-101 and MSR-VTT datasets. In summary, FancyVideo addresses the issues of existing T2V models in handling long videos by introducing a cross-frame textual guidance strategy, achieving state-of-the-art performance on multiple metrics.