LLMs achieve adult human performance on higher-order theory of mind tasks

Winnie Street,John Oliver Siy,Geoff Keeling,Adrien Baranes,Benjamin Barnett,Michael McKibben,Tatenda Kanyere,Alison Lentz,Blaise Aguera y Arcas,Robin I. M. Dunbar
2024-05-31
Abstract:This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.
Artificial Intelligence,Computation and Language,Human-Computer Interaction
What problem does this paper attempt to address?
This paper mainly discusses the development of Large Language Models (LLMs) in Theory of Mind (ToM), which is the ability of human beings to reason about their own and others' mental states. The study compares the performance of five LLMs (including GPT-4 and Flan-PaLM) with a newly collected adult baseline by introducing a new test suite called Multi-Order Theory of Mind Question & Answer (MoToMQA). The results show that GPT-4 and Flan-PaLM achieve or approach adult level performance in ToM tasks, with GPT-4 surpassing adult performance in sixth-order reasoning. The paper also analyzes the impact of model size and fine-tuning on the achievement of ToM abilities, and compares the performance of LLMs in ToM tasks with factual tasks. The study finds that LLMs generally perform better in factual tasks than in ToM tasks, and there is an "anchoring effect" where answer order may influence the model's response. These findings are of great significance for understanding the applications of LLMs in various social interactions.