Reflection-Bench: probing AI intelligence with reflection

Lingyu Li,Yixu Wang,Haiquan Zhao,Shuqi Kong,Yan Teng,Chunbo Li,Yingchun Wang
2024-10-22
Abstract:The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at <a class="link-external link-https" href="https://github.com/YabYum/ReflectionBench" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is whether large - language models (LLMs) possess true human - level intelligence or are merely complex statistical engines with the ability to imitate human languages. Specifically, the paper aims to evaluate the intelligence level of LLMs by introducing the concept of "reflection" and proposes a comprehensive benchmark test - Reflection - Bench - to assess the ability of LLMs to adjust beliefs or behaviors in the face of unexpected results. ### Core Problems of the Paper 1. **Evaluating the Intelligence Level of LLMs**: - There is currently a debate over whether LLMs have human - level intelligence. Supporters believe that LLMs are highly intelligent and may bring potential risks, thus requiring more stringent regulation; while skeptics think that over - regulation may hinder innovation. - The paper attempts to provide a more detailed and biologically - inspired standard for evaluating the intelligence of LLMs by introducing "reflection", a core concept in cognitive science. 2. **Designing Evaluation Tools**: - The paper proposes Reflection - Bench, a comprehensive benchmark test consisting of 7 tasks, covering core cognitive functions such as perception, memory, belief update, decision - making, prediction, counterfactual thinking, and meta - reflection. - These tasks are designed based on established cognitive science paradigms and can comprehensively evaluate the performance of LLMs on different cognitive components. 3. **Analyzing the Current Performance of LLMs**: - The paper evaluated 13 different LLMs, and the results show that current LLMs still have significant deficiencies in reflection ability, especially in meta - reflection ability which is generally lacking. - Through these evaluation results, the paper discusses the potential causes of these deficiencies and provides suggestions for future research. ### Main Contributions 1. **Introducing "Reflection" as an Evaluation Criterion**: - Using "reflection" as a biologically - inspired criterion for evaluating the intelligence level of AI provides a framework for understanding that is more in line with human intelligence. 2. **Proposing the Reflection - Bench Benchmark Test**: - A comprehensive benchmark test consisting of 7 tasks has been designed, covering multiple core cognitive functions and capable of comprehensively evaluating the reflection ability of LLMs. 3. **Evaluating the Performance of 13 LLMs**: - By evaluating 13 LLMs, the significant deficiencies of current LLMs in human - level reflection ability, especially the lack of meta - reflection ability, have been revealed. ### Conclusion By introducing the concept of "reflection" and proposing the Reflection - Bench benchmark test, the paper provides a new perspective and tool for evaluating the intelligence level of LLMs. The research results show that although LLMs perform well on some tasks, they still have significant deficiencies in core cognitive functions, especially in the lack of meta - reflection ability. These findings emphasize that future AI systems need to find a balance between different cognitive requirements in order to achieve true intelligence.