Guiding a Diffusion Model with a Bad Version of Itself

Tero Karras,Miika Aittala,Tuomas Kynkäänniemi,Jaakko Lehtinen,Timo Aila,Samuli Laine
2024-06-05
Abstract:The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
This paper mainly discusses how to improve the quality of image generation and control variability in the image generation diffusion models. Although existing classifier-free guidance (CFG) methods can improve conditional alignment and image quality, they reduce variability and have some limitations, such as only applicable to conditional generation and may cause sampling trajectories to deviate from the desired distribution. The authors discovered an unexpected phenomenon that by using a smaller and undertrained version of the main model to guide the generation process, image quality can be controlled without sacrificing variability. They proposed a new method called "autoguidance", which uses a weaker version of the main model (e.g. limited model capacity or training time) as the guiding model instead of an unconditional model. This significantly improves image generation on ImageNet, achieving a FID score of 1.01 at 64×64 resolution and 1.25 at 512×512 resolution, setting new records. Furthermore, this method is also applicable to unconditional diffusion models, improving their quality. The paper analyzes why CFG can improve image quality and reveals the characteristic of models overemphasizing low-probability regions under limited capacity. Through autoguidance, the model can identify and reduce errors from the main model, thus enhancing the quality of generated images. Experiments show that autoguidance can work effectively as long as both models suffer compatible degradation. In conclusion, this paper addresses the challenges of existing diffusion models in generating high-quality and diverse images, and proposes a new guiding strategy called autoguidance, which improves image generation quality while maintaining variability.