Simple and Effective Masked Diffusion Language Models

Subham Sekhar Sahoo,Marianne Arriola,Yair Schiff,Aaron Gokaslan,Edgar Marroquin,Justin T Chiu,Alexander Rush,Volodymyr Kuleshov
2024-06-12
Abstract:While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form -- it is a mixture of classical masked language modeling losses -- and can be used to train encoder-only language models that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional language model. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We release our code at: <a class="link-external link-https" href="https://github.com/kuleshov-group/mdlm" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper discusses how to improve the performance of language modeling by using an enhanced Masked Diffusion Language Model (MDLM) to narrow the gap with autoregressive (AR) methods. The authors found that a simple masked discrete diffusion model can be more expressive with effective training strategies than previously thought. They propose an optimized implementation of MDLM, which improves the log-likelihood of discrete diffusion and derives a tighter continuous-time variational lower bound (ELBO) through a substitution-based parametric approach, further enhancing performance. This objective function can be seen as a weighted average of the traditional masked language model loss, giving BERT-style encoder models reasonable generative capabilities. The paper also introduces a fast sampler that supports semi-autoregressive generation and outperforms previous semi-autoregressive models in performance. In language modeling benchmark tests, their masked diffusion model achieves new states in the diffusion model and approaches the perplexity of autoregressive models. Additionally, this approach can be extended to non-linguistic domains such as biological sequence modeling, where pre-trained DNA sequence models exhibit performance comparable to or higher than classical BERT-style training while introducing generative capabilities. In summary, the main contributions of the paper include: 1. A simplified masked diffusion language model framework that surpasses existing diffusion models in various language modeling benchmarks. 2. Proposing a substitution-based parametric method (SUBS) that tightens the continuous-time ELBO, reduces variance, and further improves performance. 3. Introducing a fast sampler that supports semi-autoregressive generation and outperforms previous semi-autoregressive models. Experimental results show that these improvements significantly enhance the performance of the model, even improving on simple baseline models that were previously considered to underperform.