Variational Diffusion Auto-encoder: Latent Space Extraction from Pre-trained Diffusion Models

Georgios Batzolis,Jan Stanczuk,Carola-Bibiane Schönlieb
2023-05-19
Abstract:As a widely recognized approach to deep generative modeling, Variational Auto-Encoders (VAEs) still face challenges with the quality of generated images, often presenting noticeable blurriness. This issue stems from the unrealistic assumption that approximates the conditional data distribution, $p(\textbf{x} | \textbf{z})$, as an isotropic Gaussian. In this paper, we propose a novel solution to address these issues. We illustrate how one can extract a latent space from a pre-existing diffusion model by optimizing an encoder to maximize the marginal data log-likelihood. Furthermore, we demonstrate that a decoder can be analytically derived post encoder-training, employing the Bayes rule for scores. This leads to a VAE-esque deep latent variable model, which discards the need for Gaussian assumptions on $p(\textbf{x} | \textbf{z})$ or the training of a separate decoder network. Our method, which capitalizes on the strengths of pre-trained diffusion models and equips them with latent spaces, results in a significant enhancement to the performance of VAEs.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the blurriness issue present in image generation using Variational Autoencoders (VAE). Specifically, traditional VAEs assume that the conditional data distribution \( p(x|z) \) is Gaussian, which leads to low-quality, often blurry images. Additionally, VAEs typically output the mean of the Gaussian distribution rather than sampling from it, exacerbating this issue. To overcome these limitations, the authors propose a new method that constructs a more flexible conditional data distribution model by extracting the latent space from a pre-trained diffusion model. This approach avoids the unrealistic Gaussian assumption for \( p(x|z) \) and leverages existing pre-trained diffusion models, significantly improving the performance of VAEs and generating clearer images. Furthermore, this method separates the training of the prior model from the encoder network, enhancing the dynamical stability of the training process.