Latent Denoising Diffusion GAN: Faster sampling, Higher image quality

Luan Thanh Trinh,Tomoki Hamagami
DOI: https://doi.org/10.1109/ACCESS.2024.3406535
2024-06-18
Abstract:Diffusion models are emerging as powerful solutions for generating high-fidelity and diverse images, often surpassing GANs under many circumstances. However, their slow inference speed hinders their potential for real-time applications. To address this, DiffusionGAN leveraged a conditional GAN to drastically reduce the denoising steps and speed up inference. Its advancement, Wavelet Diffusion, further accelerated the process by converting data into wavelet space, thus enhancing efficiency. Nonetheless, these models still fall short of GANs in terms of speed and image quality. To bridge these gaps, this paper introduces the Latent Denoising Diffusion GAN, which employs pre-trained autoencoders to compress images into a compact latent space, significantly improving inference speed and image quality. Furthermore, we propose a Weighted Learning strategy to enhance diversity and image quality. Experimental results on the CIFAR-10, CelebA-HQ, and LSUN-Church datasets prove that our model achieves state-of-the-art running speed among diffusion models. Compared to its predecessors, DiffusionGAN and Wavelet Diffusion, our model shows remarkable improvements in all evaluation metrics. Code and pre-trained checkpoints: \url{<a class="link-external link-https" href="https://github.com/thanhluantrinh/LDDGAN.git" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main aim of this paper is to address the following issues: ### Paper Objectives - **Improve Sampling Speed**: Address the slow inference speed of diffusion models in image generation tasks to meet the needs of real-time applications. - **Enhance Image Quality**: Improve the quality of images generated by diffusion models to match or surpass the current state-of-the-art models (such as StyleGAN). ### Method Overview To achieve the above objectives, the authors propose the "Latent Denoising Diffusion GAN" (LDDGAN). The core of this method lies in: 1. **Using a Pre-trained Autoencoder to Compress Input Images**: Compressing input images into a low-dimensional latent space to reduce computational costs and accelerate the training and inference process. 2. **Employing Conditional Generative Adversarial Networks (GAN) for Denoising**: Using conditional GANs to model complex and multimodal distributions, allowing for larger denoising steps and thus reducing the number of required denoising steps. 3. **Proposing a Weighted Learning Strategy**: Combining the advantages of adversarial loss and reconstruction loss to enhance image diversity while ensuring image quality. ### Main Contributions - Proposed a novel latent denoising diffusion GAN framework that leverages the compatibility of dimensionality reduction and low-dimensional latent space with the denoising process of diffusion models, thereby improving inference speed, image quality, and diversity. - Discovered that if the denoising process of diffusion models does not rely on Gaussian distribution, it is necessary to eliminate the autoencoder's dependency on learning Gaussian distribution to enhance the diversity and quality of generated images. - Proposed a new strategy called weighted learning, which enhances diversity through adversarial loss while improving image quality using reconstruction loss. - Achieved lower training costs and state-of-the-art inference speed, paving the way for real-time, high-fidelity diffusion models. ### Experimental Results - Experimental results on standard benchmark datasets (such as CIFAR-10, CelebA-HQ, and LSUN Church) show that the model achieves the fastest running speed among diffusion models while maintaining high-quality image generation. - Comparable to GAN models in terms of image generation quality, with higher diversity. - Significant improvements over previous models DiffusionGAN and Wavelet Diffusion in most evaluation metrics.