FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Jiasong Feng,Ao Ma,Jing Wang,Bo Cheng,Xiaodan Liang,Dawei Leng,Yuhui Yin

2024-08-16

Abstract:Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Lastly, TFB boosts the temporal consistency of latent features. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our video demo, code and model are available at <a class="link-external link-https" href="https://360cvgroup.github.io/FancyVideo/" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the challenges in Text-to-Video (T2V) generation, particularly when it comes to generating videos with coherent actions and long-term consistency. Existing T2V models struggle to maintain spatiotemporal logical consistency and action coherence when dealing with long videos, primarily because they typically use spatial cross-attention mechanisms for text control, which lack specific textual guidance for each frame. To solve this problem, the paper proposes a new method called FancyVideo, whose core contributions include: 1. **Cross-frame Textual Guidance Module (CTGM)**: This is an innovative design to improve existing text control mechanisms. CTGM consists of three sub-modules: - **Temporal Information Injector (TII)**: It injects time-specific information from latent features into the text conditions, thereby obtaining cross-frame text conditions. - **Temporal Affinity Refiner (TAR)**: It refines the correlation matrix between cross-frame text conditions and latent features along the temporal dimension, adjusting the temporal logic of text guidance. - **Temporal Feature Booster (TFB)**: Further enhances the temporal consistency of latent features. 2. **FancyVideo**: As the first work to deeply explore the T2V task with cross-frame textual guidance, this method significantly improves the dynamism and consistency of videos through the aforementioned CTGM. 3. **Experimental Validation**: The paper demonstrates the effectiveness of FancyVideo through a series of quantitative and qualitative evaluations, including state-of-the-art results on the EvalCrafter benchmark and competitive performance on the UCF-101 and MSR-VTT datasets. In summary, FancyVideo addresses the issues of existing T2V models in handling long videos by introducing a cross-frame textual guidance strategy, achieving state-of-the-art performance on multiple metrics.

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way

Towards Smooth Video Composition

ControlVideo: Training-free Controllable Text-to-Video Generation

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Motion Control for Enhanced Complex Action Video Generation

VEnhancer: Generative Space-Time Enhancement for Video Generation

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Text-Animator: Controllable Visual Text Video Generation

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

Fine-grained Controllable Video Generation via Object Appearance and Context

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Compositional Video Generation as Flow Equalization

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models