Abstract:Recent studies suggest that self-reflective prompting can significantly enhance the reasoning capabilities of Large Language Models (LLMs). However, the use of external feedback as a stop criterion raises doubts about the true extent of LLMs' ability to emulate human-like self-reflection. In this paper, we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA. We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models' initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. We also find that self-reflection reduces tendency toward majority voting. Based on our findings, we propose guidelines for decisions on when to implement self-reflection. We release the codebase for reproducing our experiments at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: to evaluate the self - reflection ability of large - language models (LLMs) without external feedback. Specifically, the authors focus on whether self - reflection can truly improve the performance of LLMs in complex reasoning tasks, especially in different question - answering datasets. The paper verifies this through a more stringent test environment, that is, no external feedback in any form is allowed, and multi - round iterative prompts are not allowed, to ensure that the model cannot obtain hints from previous mistakes. This setting aims to more realistically reflect the true self - reflection ability of LLMs. The main research questions of the paper are: 1. **The effect of self - reflection on different datasets**: The authors used two representative datasets - TruthfulQA and HotpotQA, which are used to evaluate the authenticity of the model - generated answers and the performance of multi - hop reasoning tasks respectively. The study found that in this strict test environment, self - reflection improved performance on TruthfulQA, but decreased performance on HotpotQA. 2. **Factors affecting the effect of self - reflection**: Further analysis shows that the effect of self - reflection is affected by the accuracy of the model's initial response and the difficulty of the question. Specifically, when the model's initial response is inaccurate and the question is difficult, self - reflection is most effective; when the model's initial response is accurate or the question is simple, self - reflection may be harmful instead. 3. **The influence of self - reflection on the majority - voting tendency**: The study also found that self - reflection reduces the model's tendency to vote with the majority, which indicates that self - reflection promotes a more complex decision - making process, although it sometimes leads to a decrease in accuracy. Based on these findings, the authors proposed guidelines for using self - reflection in practical applications, suggesting that the decision on whether to use self - reflection should be made according to the estimated response accuracy (RA) and the difficulty of the question.

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

Self-Reflection Outcome is Sensitive to Prompt Construction

Supporting Self-Reflection at Scale with Large Language Models: Insights from Randomized Field Experiments in Classrooms

Self-Reflection in LLM Agents: Effects on Problem-Solving Performance

Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models

Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning

Investigating the Efficacy of Large Language Models in Reflective Assessment Methods through Chain of Thoughts Prompting

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Large Language Models Cannot Self-Correct Reasoning Yet

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs

Large Language Models Are Better Reasoners with Self-Verification

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

Large Language Models have Intrinsic Self-Correction Ability

Reflection-Bench: probing AI intelligence with reflection

Mind's Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models

Understanding the Dark Side of LLMs' Intrinsic Self-Correction

Internal Consistency and Self-Feedback in Large Language Models: A Survey

SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

Uncovering Biases with Reflective Large Language Models