Multistep Consistency Models

Jonathan Heek,Emiel Hoogeboom,Tim Salimans
2024-06-03
Abstract:Diffusion models are relatively easy to train but require many steps to generate samples. Consistency models are far more difficult to train, but generate samples in a single step. In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas a $\infty$-step consistency model is a diffusion model. Multistep Consistency Models work really well in practice. By increasing the sample budget from a single step to 2-8 steps, we can train models more easily that generate higher quality samples, while retaining much of the sampling speed benefits. Notable results are 1.4 FID on Imagenet 64 in 8 step and 2.1 FID on Imagenet128 in 8 steps with consistency distillation, using simple losses without adversarial training. We also show that our method scales to a text-to-image diffusion model, generating samples that are close to the quality of the original model.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is that Diffusion Models require a large number of steps to generate samples, leading to high computational resource consumption and slow speed. Although Consistency Models can significantly reduce sampling time, they do so at the expense of image quality. Therefore, the paper proposes a Multistep Consistency Model, aiming to achieve a balance between sampling speed and quality by interpolating between Consistency and Diffusion Models. Specifically, the goals of the paper include: 1. **Improving generation quality**: By increasing the sampling steps (from 1 step to 2-8 steps), generating higher quality samples while maintaining high sampling speed. 2. **Simplifying training difficulty**: Compared to traditional single-step Consistency Models, Multistep Consistency Models are easier to train and can achieve performance close to standard Diffusion Models with fewer steps. 3. **Expanding application scope**: Demonstrating that this method is not only applicable to image generation tasks but can also be applied to text-to-image generation tasks, with the generated sample quality being close to the original model. The paper achieves these goals by introducing Multistep Consistency Models, combining consistency training and distillation techniques, and an improved deterministic sampler (Adjusted DDIM). Experimental results show that this method achieves significant performance improvements on the ImageNet64 and ImageNet128 datasets, with FID scores of 1.4 and 2.1 respectively at 8-step sampling. Additionally, the method also shows performance comparable to the teacher model in text-to-image generation tasks.