One-Step Diffusion Distillation via Deep Equilibrium Models

Zhengyang Geng,Ashwini Pokle,J. Zico Kolter
2023-12-12
Abstract:Diffusion models excel at producing high-quality samples but naively require hundreds of iterations, prompting multiple attempts to distill the generation process into a faster network. However, many existing approaches suffer from a variety of challenges: the process for distillation training can be complex, often requiring multiple training stages, and the resulting models perform poorly when utilized in single-step generative applications. In this paper, we introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Of particular importance to our approach is to leverage a new Deep Equilibrium (DEQ) model as the distilled architecture: the Generative Equilibrium Transformer (GET). Our method enables fully offline training with just noise/image pairs from the diffusion model while achieving superior performance compared to existing one-step methods on comparable training budgets. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5\times$ larger ViT in terms of FID scores while striking a critical balance of computational cost and image quality. Code, checkpoints, and datasets are available.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of how to simplify the multi-step sampling process of diffusion models into a single-step generation process in generative models to improve generation speed and efficiency. Specifically, the authors propose a simple and effective method to distill the multi-step diffusion process directly from initial noise to image pairs, thereby achieving a single-step generation model. The core of this method is a new deep equilibrium model—the Generative Equilibrium Transformer (GET), which can adaptively apply transformer layers during the forward pass to balance inference speed and sample quality. Additionally, GET outperforms traditional Vision Transformers (ViT) in parameter efficiency, achieving better image quality with a smaller model size. Through experiments, the authors demonstrate the superior performance of GET in various tasks and prove its efficiency and flexibility in offline distillation.