Abstract:While AI-generated content has garnered significant attention, achieving photo-realistic video synthesis remains a formidable challenge. Despite the promising advances in diffusion models for video generation quality, the complex model architecture and substantial computational demands for both training and inference create a significant gap between these models and real-world applications. This paper presents SNED, a superposition network architecture search method for efficient video diffusion model. Our method employs a supernet training paradigm that targets various model cost and resolution options using a weight-sharing method. Moreover, we propose the supernet training sampling warm-up for fast training optimization. To showcase the flexibility of our method, we conduct experiments involving both pixel-space and latent-space video diffusion models. The results demonstrate that our framework consistently produces comparable results across different model options with high efficiency. According to the experiment for the pixel-space video diffusion model, we can achieve consistent video generation results simultaneously across 64 x 64 to 256 x 256 resolutions with a large range of model sizes from 640M to 1.6B number of parameters for pixel-space video diffusion models.

What problem does this paper attempt to address?

The paper primarily addresses the challenges faced by video diffusion models in practical applications, including complex model structures, high computational demands, and efficiency issues during training and inference. It proposes a method called SNED (Superposition Network Architecture Search for Efficient Video Diffusion). Specifically, the paper aims to solve the following key issues: 1. **Model Efficiency**: Although existing video diffusion models perform well in terms of generation quality, their large model size and computational demands limit their application in real-world scenarios. The paper aims to improve the model's efficiency, making it more suitable for practical deployment. 2. **Network Architecture Design**: The current design process for video diffusion models often requires extensive trial-and-error experiments, which are time-consuming and costly. The SNED method simplifies this process through network architecture search. 3. **Multi-Resolution Support**: To meet the needs of different application scenarios, the model needs to support video generation at various resolutions. SNED achieves this through a superposition training mechanism. 4. **Dynamic Cost Sampling**: To further enhance the model's flexibility and adaptability, SNED introduces the concept of dynamic cost sampling, allowing users to select appropriate sub-networks based on specific requirements. In summary, the main goal of the paper is to develop an efficient and flexible video diffusion model through the SNED method, overcoming the limitations of existing technologies and promoting the widespread application of such models in the real world.

SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

Efficiency-optimized Video Diffusion Models

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

SimDA: Simple Diffusion Adapter for Efficient Video Generation

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Dual-Stream Diffusion Net for Text-to-Video Generation

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

SF-V: Single Forward Video Generation Model

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Video Diffusion Models with Local-Global Context Guidance

Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule

A Conditional Diffusion Model With Fast Sampling Strategy for Remote Sensing Image Super-Resolution

Accelerating Video Diffusion Models via Distribution Matching

4Diffusion: Multi-view Video Diffusion Model for 4D Generation