Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Xin Yan,Yuxuan Cai,Qiuyue Wang,Yuan Zhou,Wenhao Huang,Huan Yang
2024-12-02
Abstract:We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: <a class="link-external link-https" href="https://presto-video.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to generate long - form videos with rich content and long - range coherence. Currently, most video generation methods mainly focus on generating short clips of 3 to 8 seconds, which limits the expressiveness and richness of the generated content. To generate longer videos, earlier methods usually adopt interpolation or extrapolation stages to extend short clips, but these methods are difficult to go beyond the scene content and are limited by the finite capacity of the original short clips. Another method is to extend the video length in an autoregressive manner by adding new modules, but this will introduce the problem of error propagation. The paper proposes a new method - Presto, which can generate 15 - second long - form videos while maintaining content richness and long - range coherence. Specifically, Presto introduces the **Segmented Cross - Attention (SCA)** strategy, which divides the hidden state into multiple segments in the time dimension, and each segment can perform cross - attention with the corresponding subtitle. This method does not require additional parameters and can be seamlessly integrated into the existing Diffusion Transformer (DiT) - based architecture. In addition, to support high - quality long - form video generation, the researchers constructed a dataset named LongTake - HD, which contains 261,000 content - rich videos, each accompanied by an overall description and five progressively structured subtitles. This dataset ensures that the generated videos are not only rich in content but also maintain coherence over a long period. The experimental results show that Presto reaches 78.5% on the VBench semantic score and 100% on the dynamism index, significantly outperforming the existing state - of - the - art video generation methods. This proves Presto's excellent ability in generating rich content and maintaining long - range coherence.