SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

Zhengang Li,Yan Kang,Yuchen Liu,Difan Liu,Tobias Hinz,Feng Liu,Yanzhi Wang
2024-06-01
Abstract:While AI-generated content has garnered significant attention, achieving photo-realistic video synthesis remains a formidable challenge. Despite the promising advances in diffusion models for video generation quality, the complex model architecture and substantial computational demands for both training and inference create a significant gap between these models and real-world applications. This paper presents SNED, a superposition network architecture search method for efficient video diffusion model. Our method employs a supernet training paradigm that targets various model cost and resolution options using a weight-sharing method. Moreover, we propose the supernet training sampling warm-up for fast training optimization. To showcase the flexibility of our method, we conduct experiments involving both pixel-space and latent-space video diffusion models. The results demonstrate that our framework consistently produces comparable results across different model options with high efficiency. According to the experiment for the pixel-space video diffusion model, we can achieve consistent video generation results simultaneously across 64 x 64 to 256 x 256 resolutions with a large range of model sizes from 640M to 1.6B number of parameters for pixel-space video diffusion models.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily addresses the challenges faced by video diffusion models in practical applications, including complex model structures, high computational demands, and efficiency issues during training and inference. It proposes a method called SNED (Superposition Network Architecture Search for Efficient Video Diffusion). Specifically, the paper aims to solve the following key issues: 1. **Model Efficiency**: Although existing video diffusion models perform well in terms of generation quality, their large model size and computational demands limit their application in real-world scenarios. The paper aims to improve the model's efficiency, making it more suitable for practical deployment. 2. **Network Architecture Design**: The current design process for video diffusion models often requires extensive trial-and-error experiments, which are time-consuming and costly. The SNED method simplifies this process through network architecture search. 3. **Multi-Resolution Support**: To meet the needs of different application scenarios, the model needs to support video generation at various resolutions. SNED achieves this through a superposition training mechanism. 4. **Dynamic Cost Sampling**: To further enhance the model's flexibility and adaptability, SNED introduces the concept of dynamic cost sampling, allowing users to select appropriate sub-networks based on specific requirements. In summary, the main goal of the paper is to develop an efficient and flexible video diffusion model through the SNED method, overcoming the limitations of existing technologies and promoting the widespread application of such models in the real world.