Tian Liang,Zhiwei He,Wenxiang Jiao,Xing Wang,Yan Wang,Rui Wang,Yujiu Yang,Shuming Shi,Zhaopeng Tu
Abstract:Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of "tit for tat" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at <a class="link-external link-https" href="https://github.com/Skytliang/Multi-Agents-Debate" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The paper attempts to address the issue of poor performance of large language models (LLMs) on complex reasoning tasks, particularly the "Degeneration-of-Thought" (DoT) problem in self-reflection methods. Specifically, when an LLM gains confidence in an answer, even if the initial stance is incorrect, it cannot generate new ideas through self-reflection. To tackle this issue, the authors propose a Multi-Agent Debate (MAD) framework, which encourages divergent thinking through debates among multiple agents, thereby improving LLM performance on complex reasoning tasks.
### Main Contributions of the Paper:
1. **Definition and Proposal of the "Degeneration-of-Thought" (DoT) Problem**: This is the first time the DoT problem in self-reflection methods has been explicitly defined and proposed.
2. **Proposal of the Multi-Agent Debate (MAD) Framework**: By correcting erroneous initial stances through debates among multiple agents, the framework encourages divergent thinking, thereby improving LLM performance on complex reasoning tasks.
3. **Experimental Validation**: Experiments were conducted on two challenging datasets, including Common Machine Translation (Common MT) and Counter-Intuitive Arithmetic Reasoning (Counter-Intuitive AR). Results show that the MAD framework outperforms baseline methods on these tasks, particularly on the Common MT dataset, where GPT-3.5-Turbo combined with MAD can surpass GPT-4's performance.
### Experimental Setup and Results:
- **Datasets**:
- **Common MT**: Contains Chinese to English translation examples, used to evaluate the translation model's performance in handling lexical, context-independent, and context-dependent syntactic ambiguities.
- **Counter-Intuitive AR**: Contains 200 counter-intuitive arithmetic reasoning problems, used to evaluate LLM performance on multi-step reasoning tasks.
- **Experimental Methods**:
- **Baseline Methods**: Include Self-Reflect, Rerank, MAPS, CoT, and Self-Consistency.
- **MAD Framework**: Includes two debaters and a judge, generating the final answer through multiple rounds of debate.
- **Experimental Results**:
- On the Common MT dataset, the MAD framework significantly improved translation quality, especially in human evaluations.
- On the Counter-Intuitive AR dataset, while the MAD framework did not outperform GPT-4, it significantly outperformed other baseline methods.
### Case Analysis:
- **Common MT**: The MAD framework can correctly translate sentences requiring common sense understanding, whereas baseline methods (such as GPT-3.5-Turbo) are prone to literal translation errors.
- **Counter-Intuitive AR**: Through divergent thinking, the MAD framework can find the correct answers, while baseline methods (such as Self-Reflect) are prone to incorrect intuitive answers.
### Analysis and Discussion:
- **Mitigation of the DoT Problem**: The MAD framework addresses the DoT problem in self-reflection methods by introducing other agents' perspectives, improving model diversity and accuracy.
- **Impact of the Judge**: Different judge choices can affect the performance of the MAD framework, especially when different types of LLMs act as debaters, as the judge may exhibit preferences.
- **Debate Intensity and Iteration Count**: Appropriate "tit-for-tat" helps effective debate, and complex tasks may require more iterations to achieve optimal results.
In summary, the paper effectively addresses the DoT problem in LLMs on complex reasoning tasks by proposing the MAD framework, demonstrating the importance of encouraging divergent thinking in multi-agent debates.