When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

Yanhong Li,Chenghao Yang,Allyson Ettinger
2024-04-14
Abstract:Recent studies suggest that self-reflective prompting can significantly enhance the reasoning capabilities of Large Language Models (LLMs). However, the use of external feedback as a stop criterion raises doubts about the true extent of LLMs' ability to emulate human-like self-reflection. In this paper, we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA. We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models' initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. We also find that self-reflection reduces tendency toward majority voting. Based on our findings, we propose guidelines for decisions on when to implement self-reflection. We release the codebase for reproducing our experiments at
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: to evaluate the self - reflection ability of large - language models (LLMs) without external feedback. Specifically, the authors focus on whether self - reflection can truly improve the performance of LLMs in complex reasoning tasks, especially in different question - answering datasets. The paper verifies this through a more stringent test environment, that is, no external feedback in any form is allowed, and multi - round iterative prompts are not allowed, to ensure that the model cannot obtain hints from previous mistakes. This setting aims to more realistically reflect the true self - reflection ability of LLMs. The main research questions of the paper are: 1. **The effect of self - reflection on different datasets**: The authors used two representative datasets - TruthfulQA and HotpotQA, which are used to evaluate the authenticity of the model - generated answers and the performance of multi - hop reasoning tasks respectively. The study found that in this strict test environment, self - reflection improved performance on TruthfulQA, but decreased performance on HotpotQA. 2. **Factors affecting the effect of self - reflection**: Further analysis shows that the effect of self - reflection is affected by the accuracy of the model's initial response and the difficulty of the question. Specifically, when the model's initial response is inaccurate and the question is difficult, self - reflection is most effective; when the model's initial response is accurate or the question is simple, self - reflection may be harmful instead. 3. **The influence of self - reflection on the majority - voting tendency**: The study also found that self - reflection reduces the model's tendency to vote with the majority, which indicates that self - reflection promotes a more complex decision - making process, although it sometimes leads to a decrease in accuracy. Based on these findings, the authors proposed guidelines for using self - reflection in practical applications, suggesting that the decision on whether to use self - reflection should be made according to the estimated response accuracy (RA) and the difficulty of the question.