Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests

Max J. van Duijn,Bram M.A. van Dijk,Tom Kouwenhoven,Werner de Valk,Marco R. Spruit,Peter van der Putten
2023-10-31
Abstract:To what degree should we ascribe cognitive capacities to Large Language Models (LLMs), such as the ability to reason about intentions and beliefs known as Theory of Mind (ToM)? Here we add to this emerging debate by (i) testing 11 base- and instruction-tuned LLMs on capabilities relevant to ToM beyond the dominant false-belief paradigm, including non-literal language usage and recursive intentionality; (ii) using newly rewritten versions of standardized tests to gauge LLMs' robustness; (iii) prompting and scoring for open besides closed questions; and (iv) benchmarking LLM performance against that of children aged 7-10 on the same tasks. We find that instruction-tuned LLMs from the GPT family outperform other models, and often also children. Base-LLMs are mostly unable to solve ToM tasks, even with specialized prompting. We suggest that the interlinked evolution and development of language and ToM may help explain what instruction-tuning adds: rewarding cooperative communication that takes into account interlocutor and context. We conclude by arguing for a nuanced perspective on ToM in LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to explore the extent to which large language models (LLMs) possess the capability of Theory of Mind (ToM). Specifically, the researchers tested 11 state-of-the-art foundation models and instruction-tuned models on a series of tasks related to ToM and compared their performance with that of 7-10-year-old children. ToM refers to the ability to understand others' beliefs, intentions, desires, and other mental states to predict and explain their behavior. ### Main Research Content 1. **Test Subjects**: - 11 foundation models and instruction-tuned models. - 7-10-year-old children (37 children aged 7-8 and 36 children aged 9-10). 2. **Test Tasks**: - **Sally-Anne Test** (first-order and second-order false belief tests): Evaluates the model's ability to understand others' beliefs. - **Strange Stories Test**: Assesses the model's understanding of non-literal language and complex social situations. - **Imposing Memory Test**: Evaluates the model's handling of recursive intentionality and memory tasks. 3. **Test Methods**: - Using new versions of standardized tests to assess the robustness of the models. - Including both open-ended and closed-ended questions. - Scoring and comparing the test results of the models and children. ### Main Findings 1. **Sally-Anne Test**: - Most foundation models performed better than children on first-order ToM tasks but performed worse or comparably to children on second-order ToM tasks. - Instruction-tuned models generally outperformed children, but GPT-4 and GPT-3.5 maintained high levels on second-order ToM tasks. 2. **Strange Stories Test**: - As the task difficulty increased, children's performance gradually declined, while most models' performance remained relatively stable, even surpassing children on the most complex tasks. - GPT-4 performed nearly perfectly on all tasks, and other large instruction-tuned models also performed excellently. 3. **Imposing Memory Test**: - Foundation models performed worse than children on all tasks, but larger foundation models showed improvement with increasing recursive levels. - Instruction-tuned models generally performed worse than children, but GPT-4 performed well at all levels, especially maintaining high levels after second-order ToM tasks. ### Conclusion The researchers believe that instruction-tuned models, by enhancing cooperative communication capabilities, can better understand and handle ToM-related tasks. This may be because these models consider the interlocutor and context during training. Additionally, the study emphasizes the close relationship between language and ToM in human development and evolution, and how this relationship influences the ToM capabilities of LLMs. ### Research Significance This study not only provides detailed data on the performance of LLMs on ToM tasks but also offers new perspectives for further exploring the cognitive abilities and potential applications of LLMs. By comparing with children's performance, researchers can more comprehensively understand the capabilities and limitations of these models.