Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models

Jiacheng Ye,Shansan Gong,Liheng Chen,Lin Zheng,Jiahui Gao,Han Shi,Chuan Wu,Xin Jiang,Zhenguo Li,Wei Bi,Lingpeng Kong
2024-07-15
Abstract:Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language models. In contrast to autoregressive language models that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion language model and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems, with a small diffusion model outperforming a much larger autoregressive model in both efficiency and accuracy. In addition to that, DoT showcases promising self-correction abilities and benefits from existing reasoning-enhancing techniques like self-consistency decoding. Our findings contribute to the understanding and development of reasoning with diffusion language models.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper primarily discusses how to integrate Chain-of-Thought (CoT) techniques into diffusion language models to enhance the model's reasoning capabilities. The paper introduces a new method called "Diffusion of Thought" (DoT), which allows reasoning steps to diffuse over time through the diffusion language model. Unlike the traditional autoregressive language models that generate decisions word by word, this provides greater flexibility to balance computational cost and reasoning performance. The main contributions of DoT include: 1. On simple reasoning tasks (such as numerical multiplication and Boolean logic), DoT has an advantage over autoregressive CoT and implicit CoT, achieving up to a 27-fold speed increase without a decrease in performance. 2. DoT has been adapted to continuous and discrete diffusion base models and introduces two training-time sampling algorithms to enhance its self-correction ability. On elementary school math problems, DoT outperforms GPT2 with CoT, enabling small-scale diffusion models to surpass autoregressive models that are 4.6 times larger, demonstrating the potential of text diffusion models in complex reasoning. 3. DoT shows flexibility in the trade-off between reasoning time and performance and demonstrates self-correction capabilities. Moreover, it was found that self-consistent decoding could further improve DoT and its multi-pass variants. The paper also discusses the similarities between DoT and the previously proposed implicit CoT methods, which improve the time efficiency of autoregressive CoT generation by learning thoughts in hidden states across transformer layers. DoT achieves this by generating reasoning paths within diffusion time steps. In the experimental section, the paper conducts experiments on multi-digit multiplication, Boolean logic reasoning, and complex elementary school math problems. The results show that DoT maintains high accuracy on simple tasks while significantly improving efficiency; on complex tasks, performance can be continuously improved by increasing reasoning steps, demonstrating DoT's flexibility across tasks of varying difficulty. Additionally, the paper explores the application of self-consistency in DoT, indicating that performance can be significantly enhanced by generating and aggregating multiple samples.