Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

Haritz Puerto,Tilek Chubakov,Xiaodan Zhu,Harish Tayyar Madabushi,Iryna Gurevych
2024-07-03
Abstract:Requiring a Large Language Model to generate intermediary reasoning steps has been shown to be an effective way of boosting performance. In fact, it has been found that instruction tuning on these intermediary reasoning steps improves model performance. In this work, we present a novel method of further improving performance by requiring models to compare multiple reasoning chains before generating a solution in a single inference step. We call this method Divergent CoT (DCoT). We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, LLMs. Through a rigorous set of experiments spanning a wide range of tasks that require various reasoning types, we show that fine-tuning on DCoT consistently improves performance over the CoT baseline across model families and scales (1.3B to 70B). Through a combination of empirical and manual evaluation, we additionally show that these performance gains stem from models generating multiple divergent reasoning chains in a single inference step, indicative of the enabling of self-correction in language models. Our code and data are publicly available at <a class="link-external link-https" href="https://github.com/UKPLab/arxiv2024-divergent-cot" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance of large - language models (LLMs) in various reasoning tasks by introducing a new method - Divergent Chain of Thought (DCoT). Specifically, the paper focuses on how to make the model generate multiple different reasoning chains in a single reasoning step and select an answer from them, thereby achieving the ability of self - correction without external feedback or prompt optimization. This method aims to overcome the limitations of the traditional single - reasoning - chain method (CoT), especially the problem of poor performance on small and more accessible language models. ### Main contributions of the paper: 1. **Introduction of Divergent CoT (DCoT)**: This is an improved CoT method that can generate multiple reasoning chains in a single reasoning step and select the final answer. 2. **Demonstration of the effectiveness of DCoT**: Through a series of rigorous experiments, it has been proven that on LLMs of different scales and families, DCoT performs better than the traditional CoT method in various reasoning tasks. 3. **Discovery of the self - correction ability of DCoT**: Through empirical analysis, it has been found that DCoT enables the model to perform self - correction when generating the second reasoning chain without external feedback or prompt optimization. ### Specific methods and experimental design: - **Dataset generation**: Use GPT 3.5 Turbo to generate CoTs in a zero - sample setting, and select four random CoT trigger words to generate multiple CoTs for each question. - **Fine - tuning method**: Design a DCoT instruction template, requiring the model to generate a specified number of CoTs and select the final answer in a single reasoning step. At the same time, traditional CoT fine - tuning is also carried out as a baseline comparison. - **Evaluation metrics**: Use the macro - average F1 metric and the SQuAD metric to evaluate the performance of classification tasks and span extraction tasks respectively. ### Experimental results: - **In - domain tasks**: DCoT shows consistent and significant performance improvements on all models and datasets. - **Out - of - domain tasks**: The performance of DCoT on out - of - domain tasks is also better than that of CoT, especially in mathematics, common sense and symbolic reasoning tasks. - **Robustness**: In the Big Bench Hard benchmark test, DCoT shows performance comparable to that of CoT, indicating that it does not lead to performance degradation in challenging tasks. - **Self - correction ability**: Through the experiment of generating two CoTs, it has been proven that DCoT has self - correction ability, which has been further verified in manual analysis. ### Conclusion: By introducing the DCoT method, the paper not only improves the performance of LLMs in various reasoning tasks, but also shows for the first time that the model can perform self - correction when generating multiple reasoning chains without external feedback or prompt optimization. This finding is of great significance for future language model research and applications.