Unified Auto-Encoding with Masked Diffusion

Philippe Hansen-Estruch,Sriram Vishwanath,Amy Zhang,Manan Tomar

2024-06-26

Abstract:At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at <a class="link-external link-https" href="https://github.com/philippe-eecs/small-vision" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the problem of establishing a unified framework between generative models and self-supervised representation learning. Specifically, this paper aims to develop a unified autoencoder model that can perform both denoising tasks and efficiently generate images. Current generative models (such as diffusion models) and self-supervised representation learning models (such as masked autoencoders) perform well in their respective fields, but the combination of the two is not yet well-developed. The paper proposes a method called "Unified Masked Diffusion" (UMD), which combines block-based and noise-based corruption techniques and integrates these techniques within an autoencoder framework. UMD modifies the training process of the Diffusion Transformer (DiT) by introducing a no-noise, high-mask representation step in the diffusion noise schedule and utilizes mixed masked and noisy images for subsequent time steps. This approach not only effectively learns strong representations for linear probing but also performs well in generative tasks (such as conditional generation). Additionally, UMD does not require extensive data augmentation, multi-view, or multiple encoders, thereby improving computational efficiency. Experimental results show that UMD performs excellently in both representation learning and generative performance, and requires fewer GPU hours compared to traditional generative models.

Unified Auto-Encoding with Masked Diffusion

Masked Autoencoders for Point Cloud Self-supervised Learning.

Diffusion Models as Masked Autoencoders

Unified Generation, Reconstruction, and Representation: Generalized Diffusion with Adaptive Latent Encoding-Decoding

Simplified and Generalized Masked Diffusion for Discrete Data

LMD: Faster Image Reconstruction with Latent Masking Diffusion

Denoising Autoregressive Representation Learning

Simple and Effective Masked Diffusion Language Models

UGMAE: A Unified Framework for Graph Masked Autoencoders

UDPM: Upsampling Diffusion Probabilistic Models

Truncated Diffusion Probabilistic Models and Diffusion-based Adversarial Auto-Encoders

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Diffusion-Based Representation Learning

Fast Training of Diffusion Models with Masked Transformers

Think While You Generate: Discrete Diffusion with Planned Denoising

GUD: Generation with Unified Diffusion

Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models

Masked Diffusion Models Are Fast Distribution Learners

Unified Directly Denoising for Both Variance Preserving and Variance Exploding Diffusion Models

[MASK] is All You Need

Improving Masked Autoencoders by Learning Where to Mask