Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou,Shen Nie,Kaiwen Xue,Fengqi Zhu,Jiacheng Sun,Zhenguo Li,Chongxuan Li
2024-07-06
Abstract:Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval. Empirically, RADD is up to 3.5 times faster while achieving similar performance with the strongest baseline. Built upon the new perspective of conditional distributions, we further unify absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs. Further, our RADD models achieve SOTA performance among diffusion models on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale. Our code is available at <a class="link-external link-https" href="https://github.com/ML-GSAI/RADD" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the efficiency and performance of discrete diffusion models in language modeling tasks. Specifically, by re - parameterizing the Absorbing Discrete Diffusion (ADD), the paper proposes a new model - the Reparameterized Absorbing Discrete Diffusion (RADD). The RADD model simplifies the model structure by removing the time condition and reduces the number of function evaluations (NFEs) through caching techniques, thereby improving the sampling efficiency. In addition, the paper also reveals the theoretical connection between the Absorbing Discrete Diffusion model and Any - Order Autoregressive Models (AO - ARMs), further unifying the training objectives of these two models. ### Main Contributions 1. **In - depth Understanding of Discrete Diffusion Models**: - The paper reveals that the concrete score in the absorbing diffusion model can be expressed as the conditional probability of clean data multiplied by a time - dependent scalar. This finding explains the effectiveness of the "scaling trick" in existing work and provides a better optimization method. - Based on this finding, the paper proposes the RADD model, which simplifies the existing model structure by removing the time condition. 2. **Simpler Parameterization**: - The RADD model re - parameterizes the model by removing the time condition, making it focus on time - independent conditional probabilities and simplifying the model structure. 3. **Efficient Sampling**: - Using the re - parameterized form, the RADD model significantly reduces the number of function evaluations (NFEs) through caching techniques, thus achieving a faster sampling speed. 4. **Enhanced Zero - Shot Language Modeling Performance**: - The RADD model achieves state - of - the - art performance (measured by perplexity) in five zero - shot language modeling benchmark tests, especially on GPT - 2 - sized datasets. ### Experimental Results - **Efficient Sampling**: - As shown in Figure 2, the RADD model is more efficient than the SEDD model when using the caching strategy, especially when the number of sampling steps is large, and the efficiency improvement is more obvious. - **Improved Zero - Shot Perplexity**: - Table 1 shows the zero - shot perplexity of the RADD model on the LAMBADA, WikiText2, PTB, WikiText103, and 1 Billion Words datasets. The results show that the RADD model outperforms the SEDD model on multiple tasks, especially when using the t - DCE and AO loss functions. ### Theoretical Contributions - **Unifying Absorbing Discrete Diffusion Models and Any - Order Autoregressive Models**: - The paper proves that the objective function of the Absorbing Discrete Diffusion model is equivalent to that of the Any - Order Autoregressive model when the final total noise level tends to infinity. This theoretical result provides a theoretical basis for the interoperability of the two models. ### Related Work - **Continuous - State Diffusion Models**: - Some works attempt to apply continuous diffusion to text generation, but these models are usually not as good as autoregressive models in standard text generation tasks. - **Discrete - State Diffusion Models**: - Multiple discrete - state diffusion models have been proposed, such as D3PM and DiffusionBERT. These models perform well on some tasks, but there is still room for improvement in standard text generation tasks. In general, through theoretical analysis and experimental verification, this paper proposes a more efficient and concise discrete diffusion model RADD, providing a new solution for language modeling tasks.