Pyramidal Flow Matching for Efficient Video Generative Modeling

Yang Jin,Zhicheng Sun,Ningyuan Li,Kun Xu,Kun Xu,Hao Jiang,Nan Zhuang,Quzhe Huang,Yang Song,Yadong Mu,Zhouchen Lin
2024-10-08
Abstract:Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models will be open-sourced at <a class="link-external link-https" href="https://pyramid-flow.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of high computational complexity and data - intensiveness faced by video generation models during training and inference. Specifically, the paper focuses on the following two main challenges: 1. **Spatio - temporal complexity**: - Video generation requires modeling a large spatio - temporal space, which demands a large amount of computational resources and data usage. To reduce this complexity, existing methods usually adopt a cascaded architecture, generating video frames step by step in multiple stages, thus avoiding training directly at full resolution. - However, these methods optimize each sub - stage independently, which hinders knowledge sharing and sacrifices a certain degree of flexibility and scalability. 2. **Computational efficiency**: - In video generation, the early time steps are usually very noisy and contain less information, so it may not be necessary to operate at full resolution. Moreover, processing high - resolution video data leads to redundant calculations, increasing the training time and computational cost. To solve these problems, the authors propose a new efficient video generation framework - **Pyramidal Flow Matching**. The main innovations of this algorithm include: - **Pyramidal flow matching**: Reinterpreting the original denoising trajectory as a series of pyramid stages, where only the final stage runs at full resolution, thus significantly reducing redundant calculations in the early time steps. - **Spatio - temporal pyramid design**: Introducing spatial pyramids and time pyramids to compress representations at different scales respectively, further improving the training efficiency. - **Unified Diffusion Transformer (DiT)**: Achieving end - to - end joint training through a unified DiT model, avoiding the separate optimization of multiple models, simplifying the implementation and accelerating the training process. Through these improvements, this method can significantly reduce the required computational resources and training time while maintaining high - quality video generation. Experimental results show that this method can generate high - quality videos up to 10 seconds long at 768p resolution and 24 FPS, and only requires 20.7k A100 GPU hours of training time.