Boosting Latent Diffusion with Perceptual Objectives

Tariq Berrada,Pietro Astolfi,Jakob Verbeek,Melissa Hall,Marton Havasi,Michal Drozdzal,Yohann Benchetrit,Adriana Romero-Soriano,Karteek Alahari
2024-11-07
Abstract:Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative -- with boosts between 6% and 20% in FID -- and qualitative results when using our perceptual loss.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to improve the quality of images generated by Latent Diffusion Models (LDMs), especially maintaining details and a sense of reality in high - resolution image generation**. ### Specific problem background: 1. **Disconnection between the latent space and the decoder**: LDMs learn the data distribution in the latent space of the auto - encoder (AE) and map the generated latent representations to the RGB image space through the AE decoder. However, this training method leads to a disconnection between the diffusion model and the decoder, resulting in generated images lacking high - frequency details and having lower quality. 2. **Irregularity of the latent space**: The latent space of the pre - trained LDM auto - encoder is usually highly irregular, and small changes may lead to large changes in the generated images, further exacerbating the disconnection problem between the auto - encoder and the diffusion model. ### Proposed solution: To solve the above problems, the authors propose a new loss function - **Latent Perceptual Loss (LPL)**. Specifically: - **The role of LPL**: LPL uses the intermediate features of the auto - encoder decoder to define the loss function, thereby bridging the gap between the diffusion model and the decoder. In this way, the model can generate clearer and more realistic images with better structural consistency. - **Implementation of LPL**: LPL is obtained by standardizing, detecting outliers, and normalizing the features at different levels of the decoder, calculating the quadratic distance between these features, and then performing a weighted sum. ### Experimental verification: The authors conducted experiments on multiple datasets (ImageNet - 1k, CC12M, and S320M) to verify the effectiveness of LPL. The experimental results show that using LPL can significantly improve the quality of generated images, with the FID metric increasing by 6% to 20%, and the generated images are more visually realistic and contain more high - frequency details. ### Summary: The main contributions of this paper are: - Proposing Latent Perceptual Loss (LPL), which uses the intermediate features of the auto - encoder decoder to improve the generation quality of LDMs. - Verifying the effectiveness of LPL on multiple datasets and generation model frameworks, demonstrating its advantages in improving image quality and details. - Proving the effectiveness of LPL for different generation models (such as DDPM and conditional flow - matching models). By introducing LPL, the authors successfully solve the disconnection problem between the latent space and the decoder in LDMs, significantly improving the quality and detail performance of generated images.