Multistep Distillation of Diffusion Models via Moment Matching

Tim Salimans,Thomas Mensink,Jonathan Heek,Emiel Hoogeboom
2024-06-06
Abstract:We present a new method for making diffusion models faster to sample. The method distills many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. Our approach extends recently proposed one-step methods to the multi-step case, and provides a new perspective by interpreting these approaches in terms of moment matching. By using up to 8 sampling steps, we obtain distilled models that outperform not only their one-step versions but also their original many-step teacher models, obtaining new state-of-the-art results on the Imagenet dataset. We also show promising results on a large text-to-image model where we achieve fast generation of high resolution images directly in image space, without needing autoencoders or upsamplers.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition,Neural and Evolutionary Computing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the sampling efficiency problem of diffusion models when generating high - dimensional data such as images, videos, and audio. Specifically, although diffusion models perform well in generating high - quality data, their sampling process usually requires hundreds of neural network evaluations, which makes these models very expensive in practical applications. To solve this problem, the author proposes a new method to distill a multi - step diffusion model into a few - step model by matching the conditional expectations of clean data at different noise levels. This method not only improves the sampling speed but also exceeds the performance of the original multi - step model in some cases. ### Main contributions of the paper 1. **Multi - step distillation method**: The author proposes a new multi - step distillation method to reduce the sampling steps by matching conditional expectations, thereby accelerating the generation process of diffusion models. 2. **Theoretical explanation**: This method explains the existing one - step distillation method from the perspective of moment matching and extends it to the multi - step case. 3. **Performance improvement**: Using at most 8 sampling steps, the distilled model not only outperforms its one - step version but also surpasses the original multi - step teacher model, achieving new best results on the ImageNet dataset. 4. **Text - to - image generation**: The author shows the application of this method in large - scale text - to - image models, which can quickly generate high - resolution images without using auto - encoders or up - samplers. ### Specific technical details - **Background introduction**: - Diffusion models generate high - dimensional data through a step - by - step denoising process. - The sampling process usually requires hundreds of neural network evaluations, resulting in high computational costs. - Existing distillation methods can be divided into two types: deterministic and distributional. - **Moment - matching distillation**: - Distill a multi - step diffusion model into a few - step model by matching the conditional expectations of clean data at different noise levels. - Use two variants: alternating optimization and parameter - space moment matching. - The alternating optimization method approximates the conditional expectation through an auxiliary denoising model. - The parameter - space moment matching method directly performs moment matching in the parameter space, avoiding an additional auxiliary model. - **Experimental results**: - Experiments on the ImageNet dataset show that the distilled model with 8 sampling steps outperforms the original multi - step model in terms of the FID metric. - In the text - to - image generation task, the distilled model can quickly generate high - quality images. ### Conclusion This paper proposes an effective multi - step distillation method, which significantly improves the sampling efficiency of diffusion models while maintaining or even enhancing the generation quality. This method has shown superior performance in both image generation and text - to - image generation tasks.