Abstract:We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: <a class="link-external link-https" href="https://presto-video.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to generate long - form videos with rich content and long - range coherence. Currently, most video generation methods mainly focus on generating short clips of 3 to 8 seconds, which limits the expressiveness and richness of the generated content. To generate longer videos, earlier methods usually adopt interpolation or extrapolation stages to extend short clips, but these methods are difficult to go beyond the scene content and are limited by the finite capacity of the original short clips. Another method is to extend the video length in an autoregressive manner by adding new modules, but this will introduce the problem of error propagation. The paper proposes a new method - Presto, which can generate 15 - second long - form videos while maintaining content richness and long - range coherence. Specifically, Presto introduces the **Segmented Cross - Attention (SCA)** strategy, which divides the hidden state into multiple segments in the time dimension, and each segment can perform cross - attention with the corresponding subtitle. This method does not require additional parameters and can be seamlessly integrated into the existing Diffusion Transformer (DiT) - based architecture. In addition, to support high - quality long - form video generation, the researchers constructed a dataset named LongTake - HD, which contains 261,000 content - rich videos, each accompanied by an overall description and five progressively structured subtitles. This dataset ensures that the generated videos are not only rich in content but also maintain coherence over a long period. The experimental results show that Presto reaches 78.5% on the VBench semantic score and 100% on the dynamism index, significantly outperforming the existing state - of - the - art video generation methods. This proves Presto's excellent ability in generating rich content and maintaining long - range coherence.

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Latent Video Diffusion Models for High-Fidelity Long Video Generation

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

VideoStudio: Generating Consistent-Content and Multi-Scene Videos

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Towards Long Video Understanding via Fine-detailed Video Story Generation

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Progressive Autoregressive Video Diffusion Models

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

VEnhancer: Generative Space-Time Enhancement for Video Generation

FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text