Abstract:Requiring a Large Language Model to generate intermediary reasoning steps has been shown to be an effective way of boosting performance. In fact, it has been found that instruction tuning on these intermediary reasoning steps improves model performance. In this work, we present a novel method of further improving performance by requiring models to compare multiple reasoning chains before generating a solution in a single inference step. We call this method Divergent CoT (DCoT). We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, LLMs. Through a rigorous set of experiments spanning a wide range of tasks that require various reasoning types, we show that fine-tuning on DCoT consistently improves performance over the CoT baseline across model families and scales (1.3B to 70B). Through a combination of empirical and manual evaluation, we additionally show that these performance gains stem from models generating multiple divergent reasoning chains in a single inference step, indicative of the enabling of self-correction in language models. Our code and data are publicly available at <a class="link-external link-https" href="https://github.com/UKPLab/arxiv2024-divergent-cot" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the performance of large - language models (LLMs) in various reasoning tasks by introducing a new method - Divergent Chain of Thought (DCoT). Specifically, the paper focuses on how to make the model generate multiple different reasoning chains in a single reasoning step and select an answer from them, thereby achieving the ability of self - correction without external feedback or prompt optimization. This method aims to overcome the limitations of the traditional single - reasoning - chain method (CoT), especially the problem of poor performance on small and more accessible language models. ### Main contributions of the paper: 1. **Introduction of Divergent CoT (DCoT)**: This is an improved CoT method that can generate multiple reasoning chains in a single reasoning step and select the final answer. 2. **Demonstration of the effectiveness of DCoT**: Through a series of rigorous experiments, it has been proven that on LLMs of different scales and families, DCoT performs better than the traditional CoT method in various reasoning tasks. 3. **Discovery of the self - correction ability of DCoT**: Through empirical analysis, it has been found that DCoT enables the model to perform self - correction when generating the second reasoning chain without external feedback or prompt optimization. ### Specific methods and experimental design: - **Dataset generation**: Use GPT 3.5 Turbo to generate CoTs in a zero - sample setting, and select four random CoT trigger words to generate multiple CoTs for each question. - **Fine - tuning method**: Design a DCoT instruction template, requiring the model to generate a specified number of CoTs and select the final answer in a single reasoning step. At the same time, traditional CoT fine - tuning is also carried out as a baseline comparison. - **Evaluation metrics**: Use the macro - average F1 metric and the SQuAD metric to evaluate the performance of classification tasks and span extraction tasks respectively. ### Experimental results: - **In - domain tasks**: DCoT shows consistent and significant performance improvements on all models and datasets. - **Out - of - domain tasks**: The performance of DCoT on out - of - domain tasks is also better than that of CoT, especially in mathematics, common sense and symbolic reasoning tasks. - **Robustness**: In the Big Bench Hard benchmark test, DCoT shows performance comparable to that of CoT, indicating that it does not lead to performance degradation in challenging tasks. - **Self - correction ability**: Through the experiment of generating two CoTs, it has been proven that DCoT has self - correction ability, which has been further verified in manual analysis. ### Conclusion: By introducing the DCoT method, the paper not only improves the performance of LLMs in various reasoning tasks, but also shows for the first time that the model can perform self - correction when generating multiple reasoning chains without external feedback or prompt optimization. This finding is of great significance for future language model research and applications.

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

On the Impact of Fine-Tuning on Chain-of-Thought Reasoning

Distilling Reasoning Ability from Large Language Models with Adaptive Thinking

mCoT: Multilingual Instruction Tuning for Reasoning Consistency in Language Models

A comparison of chain-of-thought reasoning strategies across datasets and models

The Impact of Reasoning Step Length on Large Language Models

Training Chain-of-Thought via Latent-Variable Inference

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic

ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting

Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Small Language Models Fine-tuned to Coordinate Larger Language Models improve Complex Reasoning

Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Patience Is The Key to Large Language Model Reasoning