Abstract:Large-scale Text-to-Video (T2V) diffusion models have recently demonstrated unprecedented capability to transform natural language descriptions into stunning and photorealistic videos. Despite the promising results, a significant challenge remains: these models struggle to fully grasp complex compositional interactions between multiple concepts and actions. This issue arises when some words dominantly influence the final video, overshadowing other <a class="link-external link-http" href="http://concepts.To" rel="external noopener nofollow">this http URL</a> tackle this problem, we introduce \textbf{Vico}, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly. At its core, Vico analyzes how input tokens influence the generated video, and adjusts the model to prevent any single concept from dominating. Specifically, Vico extracts attention weights from all layers to build a spatial-temporal attention graph, and then estimates the influence as the \emph{max-flow} from the source text token to the video target token. Although the direct computation of attention flow in diffusion models is typically infeasible, we devise an efficient approximation based on subgraph flows and employ a fast and vectorized implementation, which in turn makes the flow computation manageable and differentiable. By updating the noisy latent to balance these flows, Vico captures complex interactions and consequently produces videos that closely adhere to textual descriptions. We apply our method to multiple diffusion-based video models for compositional T2V and video editing. Empirical results demonstrate that our framework significantly enhances the compositional richness and accuracy of the generated videos. Visit our website at~\href{<a class="link-external link-https" href="https://adamdad.github.io/vico/" rel="external noopener nofollow">this https URL</a>}{\url{<a class="link-external link-https" href="https://adamdad.github.io/vico/" rel="external noopener nofollow">this https URL</a>}}.

Text2Video: an End-to-end Learning Framework for Expressing Text with Videos

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Transcript to Video: Efficient Clip Sequencing from Texts

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

ControlVideo: Training-free Controllable Text-to-Video Generation

Jointly Modeling Embedding and Translation to Bridge Video and Language

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

Learning Video-Text Aligned Representations for Video Captioning

Towards Long Video Understanding via Fine-detailed Video Story Generation

Text-driven Video Prediction

Compositional Video Generation as Flow Equalization

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Video Captioning with Transferred Semantic Attributes.

Video Captioning With Attention-Based LSTM and Semantic Consistency

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System