Abstract:While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form -- it is a mixture of classical masked language modeling losses -- and can be used to train encoder-only language models that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional language model. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We release our code at: <a class="link-external link-https" href="https://github.com/kuleshov-group/mdlm" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper discusses how to improve the performance of language modeling by using an enhanced Masked Diffusion Language Model (MDLM) to narrow the gap with autoregressive (AR) methods. The authors found that a simple masked discrete diffusion model can be more expressive with effective training strategies than previously thought. They propose an optimized implementation of MDLM, which improves the log-likelihood of discrete diffusion and derives a tighter continuous-time variational lower bound (ELBO) through a substitution-based parametric approach, further enhancing performance. This objective function can be seen as a weighted average of the traditional masked language model loss, giving BERT-style encoder models reasonable generative capabilities. The paper also introduces a fast sampler that supports semi-autoregressive generation and outperforms previous semi-autoregressive models in performance. In language modeling benchmark tests, their masked diffusion model achieves new states in the diffusion model and approaches the perplexity of autoregressive models. Additionally, this approach can be extended to non-linguistic domains such as biological sequence modeling, where pre-trained DNA sequence models exhibit performance comparable to or higher than classical BERT-style training while introducing generative capabilities. In summary, the main contributions of the paper include: 1. A simplified masked diffusion language model framework that surpasses existing diffusion models in various language modeling benchmarks. 2. Proposing a substitution-based parametric method (SUBS) that tightens the continuous-time ELBO, reduces variance, and further improves performance. 3. Introducing a fast sampler that supports semi-autoregressive generation and outperforms previous semi-autoregressive models. Experimental results show that these improvements significantly enhance the performance of the model, even improving on simple baseline models that were previously considered to underperform.

Simple and Effective Masked Diffusion Language Models

Simplified and Generalized Masked Diffusion for Discrete Data

A Cheaper and Better Diffusion Language Model with Soft-Masked Noise

Multimodal Latent Language Modeling with Next-Token Diffusion

Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Masked Diffusion Models Are Fast Distribution Learners

Scaling up Masked Diffusion Models on Text

Diffusion Models as Masked Autoencoders

Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models

[MASK] is All You Need

LMD: Faster Image Reconstruction with Latent Masking Diffusion

Diffusion Models as Masked Audio-Video Learners

AMOM: Adaptive Masking over Masking for Conditional Masked Language Model

Think While You Generate: Discrete Diffusion with Planned Denoising

Utilizing Latent Diffusion Model to Accelerate Sampling Speed and Enhance Text Generation Quality

Latent Diffusion for Language Generation

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures