$ε$-VAE: Denoising as Visual Decoding

Long Zhao,Sanghyun Woo,Ziyu Wan,Yandong Li,Han Zhang,Boqing Gong,Hartwig Adam,Xuhui Jia,Ting Liu

2024-10-05

Abstract:In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approach. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.

Computer Vision and Pattern Recognition,Artificial Intelligence,Image and Video Processing

What problem does this paper attempt to address?

The problem this paper attempts to address is the inadequacy of current visual generative models' autoencoders in terms of compression and reconstruction quality of high-dimensional visual data. Specifically, existing visual tokenization methods rely on traditional autoencoder frameworks, where the encoder compresses the data into latent representations, and the decoder reconstructs the original input in one go. While this approach is effective, there is still room for improvement in terms of compression rate and reconstruction quality. To improve this, the authors propose a new perspective, treating denoising as part of the decoding process rather than a traditional one-step reconstruction. Specifically, they replace the traditional decoder with a diffusion process that iteratively refines the noise to recover the original image, guided by the latent representations provided by the encoder. This method aims to achieve higher compression rates and better generative quality by combining iterative generation and autoencoding. The main contributions include: 1. Introducing a new approach that redefines the autoencoder's decoding process as a conditional denoising problem. 2. Proposing a series of key design choices to optimize performance. 3. Demonstrating through extensive controlled experiments that this method outperforms existing visual autoencoder paradigms in terms of reconstruction and generative quality. Overall, this paper aims to improve the compression rate and generative quality of visual generative models by introducing an iterative denoising decoding process.

$ε$-VAE: Denoising as Visual Decoding

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Denoising Autoregressive Representation Learning

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

Direct Unsupervised Denoising

DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents

Variantional autoencoder with decremental information bottleneck for disentanglement

FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition

Variational Autoencoding Molecular Graphs with Denoising Diffusion Probabilistic Model

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

eVAE: Evolutionary Variational Autoencoder

Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders

Denoising with a Joint-Embedding Predictive Architecture

Conditional Denoising Diffusion for Sequential Recommendation

Unified Directly Denoising for Both Variance Preserving and Variance Exploding Diffusion Models

Generating Diverse High-Fidelity Images with VQ-VAE-2

Variational Diffusion Auto-encoder: Latent Space Extraction from Pre-trained Diffusion Models

Data Generation and Latent Space Based Feature Transfer Using ED-VAEGAN, an Improved Encoder and Decoder Loss VAEGAN Network

Variational Auto-Decoder: A Method for Neural Generative Modeling from Incomplete Data

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Fidelity or Quality? A Region-Aware Framework for Enhanced Image Decoding Via Hybrid Neural Networks