Self-Reflection in LLM Agents: Effects on Problem-Solving Performance

Matthew Renze,Erhan Guven
2024-10-17
Abstract:In this study, we investigated the effects of self-reflection in large language models (LLMs) on problem-solving performance. We instructed nine popular LLMs to answer a series of multiple-choice questions to provide a performance baseline. For each incorrectly answered question, we instructed eight types of self-reflecting LLM agents to reflect on their mistakes and provide themselves with guidance to improve problem-solving. Then, using this guidance, each self-reflecting agent attempted to re-answer the same questions. Our results indicate that LLM agents are able to significantly improve their problem-solving performance through self-reflection ($p < 0.001$). In addition, we compared the various types of self-reflection to determine their individual contribution to performance. All code and data are available on GitHub at <a class="link-external link-https" href="https://github.com/matthewrenze/self-reflection" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is: the performance of large language models (LLMs) in problem-solving can be significantly improved through self-reflection. Specifically, the researchers explored the impact of different types of self-reflection on LLMs' performance in answering multiple-choice questions and validated the effectiveness of these methods through experiments. ### Background and Motivation 1. **Background**: - Self-reflection is a metacognitive strategy that helps individuals review their thought processes, identify errors, and improve future decisions. - Large language models (LLMs) can enhance problem-solving abilities through Chain of Thought (CoT) but still make logical and mathematical errors. - Humans can avoid repeating the same mistakes through self-reflection, so researchers hope to apply this strategy to LLMs to improve their problem-solving abilities. 2. **Motivation**: - Although existing LLMs perform well in multi-step problem-solving tasks, they still face issues such as limited knowledge, reasoning errors, and hallucinated outputs. - By introducing self-reflection, LLMs can better identify and correct their errors, thereby improving overall performance. ### Research Methods 1. **Dataset**: - Data from multiple popular LLM benchmarks were used, including ARC, AGIEval, HellaSwag, MedMCQA, etc. - 100 questions were randomly selected from each dataset, forming a multi-domain exam with 1000 questions. 2. **Models**: - Nine popular LLMs were evaluated, including GPT-4, Llama 2 70B, Gemini 1.5 Pro, etc. 3. **Agent Types**: - Eight different types of self-reflection agents were designed, each generating different self-reflection texts after answering incorrectly and then re-answering the questions. - These agent types include: Retry, Keywords, Advice, Explanation, Instructions, Solution, Composite, and Unredacted. 4. **Experimental Procedure**: - The baseline agent first answered all questions, with correct answers counted towards the baseline score, and incorrect answers entered the reflection queue. - Each self-reflection agent reflected on the incorrect answers, generated corresponding self-reflection texts, and re-answered the questions. - By calculating the accuracy of each agent and comparing it with the baseline agent, the impact of different types of self-reflection on performance was analyzed. ### Main Findings 1. **Overall Effect**: - All types of self-reflection significantly improved LLM performance (p < 0.001). - Among them, self-reflection types containing more information (such as Instructions, Explanation, Solution) performed better. 2. **Model Differences**: - Different LLMs showed similar performance improvements under various types of self-reflection, and the improvement effects were statistically significant. 3. **Problem Domain Differences**: - In certain problem domains (such as LSAT-AR), the effect of self-reflection was more pronounced; while in other domains (such as SAT English), the effect was relatively smaller. ### Conclusion 1. **Practical Significance**: - This study provides practical guidance for building LLM systems with self-reflection capabilities, helping to avoid repeated errors and improve problem-solving abilities. - For AI engineers, these suggestions can help them design more effective LLM agents. 2. **Theoretical Significance**: - This study provides a theoretical basis for studying metacognitive processes in LLMs, indicating that LLMs can improve their performance through self-reflection. ### Future Research Directions 1. **Complex Problems**: - Use more complex multi-step problems to further verify the effect of self-reflection. 2. **External Tools**: - Explore the effect of combining self-reflection with external tools (such as compilers, search engines). 3. **External