Abstract:In this study, we investigated the effects of self-reflection in large language models (LLMs) on problem-solving performance. We instructed nine popular LLMs to answer a series of multiple-choice questions to provide a performance baseline. For each incorrectly answered question, we instructed eight types of self-reflecting LLM agents to reflect on their mistakes and provide themselves with guidance to improve problem-solving. Then, using this guidance, each self-reflecting agent attempted to re-answer the same questions. Our results indicate that LLM agents are able to significantly improve their problem-solving performance through self-reflection ($p < 0.001$). In addition, we compared the various types of self-reflection to determine their individual contribution to performance. All code and data are available on GitHub at <a class="link-external link-https" href="https://github.com/matthewrenze/self-reflection" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
The problem this paper attempts to address is: the performance of large language models (LLMs) in problem-solving can be significantly improved through self-reflection. Specifically, the researchers explored the impact of different types of self-reflection on LLMs' performance in answering multiple-choice questions and validated the effectiveness of these methods through experiments.
### Background and Motivation
1. **Background**:
- Self-reflection is a metacognitive strategy that helps individuals review their thought processes, identify errors, and improve future decisions.
- Large language models (LLMs) can enhance problem-solving abilities through Chain of Thought (CoT) but still make logical and mathematical errors.
- Humans can avoid repeating the same mistakes through self-reflection, so researchers hope to apply this strategy to LLMs to improve their problem-solving abilities.
2. **Motivation**:
- Although existing LLMs perform well in multi-step problem-solving tasks, they still face issues such as limited knowledge, reasoning errors, and hallucinated outputs.
- By introducing self-reflection, LLMs can better identify and correct their errors, thereby improving overall performance.
### Research Methods
1. **Dataset**:
- Data from multiple popular LLM benchmarks were used, including ARC, AGIEval, HellaSwag, MedMCQA, etc.
- 100 questions were randomly selected from each dataset, forming a multi-domain exam with 1000 questions.
2. **Models**:
- Nine popular LLMs were evaluated, including GPT-4, Llama 2 70B, Gemini 1.5 Pro, etc.
3. **Agent Types**:
- Eight different types of self-reflection agents were designed, each generating different self-reflection texts after answering incorrectly and then re-answering the questions.
- These agent types include: Retry, Keywords, Advice, Explanation, Instructions, Solution, Composite, and Unredacted.
4. **Experimental Procedure**:
- The baseline agent first answered all questions, with correct answers counted towards the baseline score, and incorrect answers entered the reflection queue.
- Each self-reflection agent reflected on the incorrect answers, generated corresponding self-reflection texts, and re-answered the questions.
- By calculating the accuracy of each agent and comparing it with the baseline agent, the impact of different types of self-reflection on performance was analyzed.
### Main Findings
1. **Overall Effect**:
- All types of self-reflection significantly improved LLM performance (p < 0.001).
- Among them, self-reflection types containing more information (such as Instructions, Explanation, Solution) performed better.
2. **Model Differences**:
- Different LLMs showed similar performance improvements under various types of self-reflection, and the improvement effects were statistically significant.
3. **Problem Domain Differences**:
- In certain problem domains (such as LSAT-AR), the effect of self-reflection was more pronounced; while in other domains (such as SAT English), the effect was relatively smaller.
### Conclusion
1. **Practical Significance**:
- This study provides practical guidance for building LLM systems with self-reflection capabilities, helping to avoid repeated errors and improve problem-solving abilities.
- For AI engineers, these suggestions can help them design more effective LLM agents.
2. **Theoretical Significance**:
- This study provides a theoretical basis for studying metacognitive processes in LLMs, indicating that LLMs can improve their performance through self-reflection.
### Future Research Directions
1. **Complex Problems**:
- Use more complex multi-step problems to further verify the effect of self-reflection.
2. **External Tools**:
- Explore the effect of combining self-reflection with external tools (such as compilers, search engines).
3. **External