Abstract:Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language models. In contrast to autoregressive language models that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion language model and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems, with a small diffusion model outperforming a much larger autoregressive model in both efficiency and accuracy. In addition to that, DoT showcases promising self-correction abilities and benefits from existing reasoning-enhancing techniques like self-consistency decoding. Our findings contribute to the understanding and development of reasoning with diffusion language models.

What problem does this paper attempt to address?

The paper primarily discusses how to integrate Chain-of-Thought (CoT) techniques into diffusion language models to enhance the model's reasoning capabilities. The paper introduces a new method called "Diffusion of Thought" (DoT), which allows reasoning steps to diffuse over time through the diffusion language model. Unlike the traditional autoregressive language models that generate decisions word by word, this provides greater flexibility to balance computational cost and reasoning performance. The main contributions of DoT include: 1. On simple reasoning tasks (such as numerical multiplication and Boolean logic), DoT has an advantage over autoregressive CoT and implicit CoT, achieving up to a 27-fold speed increase without a decrease in performance. 2. DoT has been adapted to continuous and discrete diffusion base models and introduces two training-time sampling algorithms to enhance its self-correction ability. On elementary school math problems, DoT outperforms GPT2 with CoT, enabling small-scale diffusion models to surpass autoregressive models that are 4.6 times larger, demonstrating the potential of text diffusion models in complex reasoning. 3. DoT shows flexibility in the trade-off between reasoning time and performance and demonstrates self-correction capabilities. Moreover, it was found that self-consistent decoding could further improve DoT and its multi-pass variants. The paper also discusses the similarities between DoT and the previously proposed implicit CoT methods, which improve the time efficiency of autoregressive CoT generation by learning thoughts in hidden states across transformer layers. DoT achieves this by generating reasoning paths within diffusion time steps. In the experimental section, the paper conducts experiments on multi-digit multiplication, Boolean logic reasoning, and complex elementary school math problems. The results show that DoT maintains high accuracy on simple tasks while significantly improving efficiency; on complex tasks, performance can be continuously improved by increasing reasoning steps, demonstrating DoT's flexibility across tasks of varying difficulty. Additionally, the paper explores the application of self-consistency in DoT, indicating that performance can be significantly enhanced by generating and aggregating multiple samples.

Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models

Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Implicit Chain of Thought Reasoning via Knowledge Distillation

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

Diffusion Guided Language Modeling

Text Diffusion with Reinforced Conditioning

Understanding Reasoning in Chain-of-Thought from the Hopfieldian View

Logic Diffusion for Knowledge Graph Reasoning

Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation

Are Diffusion Models Vision-And-Language Reasoners?

Chain-of-Thought Augmentation with Logit Contrast for Enhanced Reasoning in Language Models

Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent Space

Measuring Faithfulness in Chain-of-Thought Reasoning

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation

Can Diffusion Model Achieve Better Performance in Text Generation? Bridging the Gap Between Training and Inference!

TESS: Text-to-Text Self-Conditioned Simplex Diffusion

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models