Abstract:Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models will be open-sourced at <a class="link-external link-https" href="https://pyramid-flow.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems of high computational complexity and data - intensiveness faced by video generation models during training and inference. Specifically, the paper focuses on the following two main challenges: 1. **Spatio - temporal complexity**: - Video generation requires modeling a large spatio - temporal space, which demands a large amount of computational resources and data usage. To reduce this complexity, existing methods usually adopt a cascaded architecture, generating video frames step by step in multiple stages, thus avoiding training directly at full resolution. - However, these methods optimize each sub - stage independently, which hinders knowledge sharing and sacrifices a certain degree of flexibility and scalability. 2. **Computational efficiency**: - In video generation, the early time steps are usually very noisy and contain less information, so it may not be necessary to operate at full resolution. Moreover, processing high - resolution video data leads to redundant calculations, increasing the training time and computational cost. To solve these problems, the authors propose a new efficient video generation framework - **Pyramidal Flow Matching**. The main innovations of this algorithm include: - **Pyramidal flow matching**: Reinterpreting the original denoising trajectory as a series of pyramid stages, where only the final stage runs at full resolution, thus significantly reducing redundant calculations in the early time steps. - **Spatio - temporal pyramid design**: Introducing spatial pyramids and time pyramids to compress representations at different scales respectively, further improving the training efficiency. - **Unified Diffusion Transformer (DiT)**: Achieving end - to - end joint training through a unified DiT model, avoiding the separate optimization of multiple models, simplifying the implementation and accelerating the training process. Through these improvements, this method can significantly reduce the required computational resources and training time while maintaining high - quality video generation. Experimental results show that this method can generate high - quality videos up to 10 seconds long at 768p resolution and 24 FPS, and only requires 20.7k A100 GPU hours of training time.

Pyramidal Flow Matching for Efficient Video Generative Modeling

A Unified Pyramid Recurrent Network for Video Frame Interpolation

Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Generative Video Propagation

Video Probabilistic Diffusion Models in Projected Latent Space

Video Frame Interpolation and Enhancement Via Pyramid Recurrent Framework

FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner

Neighbor Correspondence Matching for Flow-based Video Frame Synthesis.

Dual-view Pyramid Network for Video Frame Interpolation

Representing Long Volumetric Video with Temporal Gaussian Hierarchy

Photorealistic Video Generation with Diffusion Models

Multi-Frame Pyramid Refinement Network for Video Frame Interpolation.

Accelerating Video Diffusion Models via Distribution Matching

GenDeF: Learning Generative Deformation Field for Video Generation

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Generative Modeling with Flow-Guided Density Ratio Learning

Unsupervised Bi-directional Flow-based Video Generation from One Snapshot.