EM Distillation for One-step Diffusion Models

Sirui Xie,Zhisheng Xiao,Diederik P Kingma,Tingbo Hou,Ying Nian Wu,Kevin Patrick Murphy,Tim Salimans,Ben Poole,Ruiqi Gao
2024-05-27
Abstract:While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
EM Distillation, proposed in this paper, explores how to efficiently extract generator models from diffusion models to achieve efficient one-step sampling. Diffusion models perform well in generating high-quality images and other modal data, but their sampling process requires multiple iterations, which is computationally expensive. Existing distillation methods can accelerate sampling, but they still have limitations, such as performance degradation with a small number of sampling steps, reliance on training data, or inability to capture patterns that represent the complete distribution. The paper introduces a Maximum Likelihood Estimation (MLE) based method called EM Distillation, which aims to minimize the mode-covering difference between the pretrained diffusion teacher model and the latent variable student model. This method updates the parameters of the student model through the Expectation-Maximization (EM) framework and stabilizes the distillation process using Monte Carlo sampling and noise elimination techniques. The paper also reveals the connection between EM Distillation and existing methods such as Variational Score Distillation and Diff-Instruct, and demonstrates the trade-off between pattern search and mode covering by adjusting the Markov Chain Monte Carlo (MCMC) sampling intensity. Experimental results show that EM Distillation outperforms existing one-step generation methods in terms of FID scores on ImageNet-64 and ImageNet-128 conditional generation tasks, and performs well compared to pretrained diffusion models in text-to-image generation.