$ε$-VAE: Denoising as Visual Decoding

Long Zhao,Sanghyun Woo,Ziyu Wan,Yandong Li,Han Zhang,Boqing Gong,Hartwig Adam,Xuhui Jia,Ting Liu
2024-10-05
Abstract:In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approach. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.
Computer Vision and Pattern Recognition,Artificial Intelligence,Image and Video Processing
What problem does this paper attempt to address?
The problem this paper attempts to address is the inadequacy of current visual generative models' autoencoders in terms of compression and reconstruction quality of high-dimensional visual data. Specifically, existing visual tokenization methods rely on traditional autoencoder frameworks, where the encoder compresses the data into latent representations, and the decoder reconstructs the original input in one go. While this approach is effective, there is still room for improvement in terms of compression rate and reconstruction quality. To improve this, the authors propose a new perspective, treating denoising as part of the decoding process rather than a traditional one-step reconstruction. Specifically, they replace the traditional decoder with a diffusion process that iteratively refines the noise to recover the original image, guided by the latent representations provided by the encoder. This method aims to achieve higher compression rates and better generative quality by combining iterative generation and autoencoding. The main contributions include: 1. Introducing a new approach that redefines the autoencoder's decoding process as a conditional denoising problem. 2. Proposing a series of key design choices to optimize performance. 3. Demonstrating through extensive controlled experiments that this method outperforms existing visual autoencoder paradigms in terms of reconstruction and generative quality. Overall, this paper aims to improve the compression rate and generative quality of visual generative models by introducing an iterative denoising decoding process.