Think While You Generate: Discrete Diffusion with Planned Denoising

Sulin Liu,Juno Nam,Andrew Campbell,Hannes Stärk,Yilun Xu,Tommi Jaakkola,Rafael Gómez-Bombarelli
2024-10-09
Abstract:Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduce Discrete Diffusion with Planned Denoising (DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying the most corrupted positions in need of denoising, including both initially corrupted and those requiring additional refinement. This plan-and-denoise approach enables more efficient reconstruction during generation by iteratively identifying and denoising corruptions in the optimal order. DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, OpenWebText, and token-based generation on ImageNet $256 \times 256$. Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. Code is available at <a class="link-external link-https" href="https://github.com/liusulin/DDPD" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance bottleneck of discrete diffusion models in generation tasks, especially how to narrow the performance gap between discrete diffusion models and autoregressive models. Specifically, the author proposes a new framework - Discrete Diffusion with Planned Denoising (DDPD) - to improve the quality and efficiency of discrete data generation. ### Main problems 1. **Performance gap**: Discrete diffusion models still lag behind autoregressive models in some tasks, especially in terms of generation perplexity. 2. **Sampling efficiency**: Traditional discrete diffusion methods rely on a single denoising model and cannot flexibly select the positions that need to be denoised, resulting in low sampling efficiency. 3. **Error - correction ability**: Once tokens are filled in existing methods, it is difficult to further correct errors, which affects the generation quality. ### Solutions To solve the above problems, DDPD introduces two key components: - **Planner**: Responsible for identifying the positions in the current sequence that are most likely to be corrupted and deciding where to denoise next. - **Denoiser**: According to the planner's selection, predict and repair the values at the selected positions. By decomposing the generation process into two stages of planning and denoising, DDPD can reconstruct the sequence more efficiently, gradually identify and correct errors, thereby improving the generation quality and efficiency. ### Specific contributions 1. **New framework**: Introduced the DDPD framework, which divides the generation process into two parts: planning and denoising. 2. **Adaptive sampling scheme**: Utilized the output of the planner to implement an adaptive sampling algorithm that can continuously self - correct errors. 3. **Simplified training objective**: Derived a simple and effective training objective, which trains the planner and denoiser separately based on maximizing the evidence lower bound (ELBO) of the discrete diffusion process. 4. **Experimental verification**: In GPT - 2 - scale language modeling and 256×256 image token generation tasks, DDPD significantly outperforms existing mask diffusion methods, and can significantly improve the generation quality even when using smaller or weaker denoiser models. ### Conclusion By introducing the planning mechanism, DDPD not only improves the generation quality of discrete diffusion models, but also solves the problems of low sampling efficiency and insufficient error - correction ability in traditional methods. This framework provides a new solution for discrete data generation tasks and is expected to further narrow the performance gap between discrete diffusion models and autoregressive models.