Testing theory of mind in large language models and humans

James W. A. Strachan,Dalila Albergo,Giulia Borghini,Oriana Pansardi,Eugenio Scaliti,Saurabh Gupta,Krati Saxena,Alessandro Rufo,Stefano Panzeri,Guido Manzi,Michael S. A. Graziano,Cristina Becchio

DOI: https://doi.org/10.1038/s41562-024-01882-z

IF: 24.252

2024-05-21

Nature Human Behaviour

Abstract:At the core of what defines us as humans is the concept of theory of mind: the ability to track other people's mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences.

psychology, experimental,neurosciences,multidisciplinary sciences

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate whether the performance of large language models (LLMs) in "Theory of Mind" (ToM) tasks is similar to or surpasses that of humans. Specifically, researchers designed a series of tests to measure different Theory of Mind abilities, including understanding false beliefs, interpreting indirect requests, recognizing sarcasm and faux pas (social gaffes). Through these tests, the researchers hope to understand: 1. **Whether LLMs can exhibit Theory of Mind abilities similar to those of humans**: The researchers compared the performance of two major LLM families (GPT and LLaMA2) on multiple Theory of Mind tasks and contrasted it with the scores of human participants. 2. **Whether there are specific patterns or limitations in the performance of LLMs**: For example, LLMs may perform better or worse on certain tasks, which helps to reveal the capabilities and limitations of LLMs when handling social - cognitive tasks. 3. **Whether the performance of LLMs is affected by the test format**: To rule out the possibility that LLMs have simply memorized the training data, the researchers used new test items with the same logic as the original tests but different semantic content. 4. **The specific mechanisms of LLMs when handling complex social situations**: By analyzing the performance of LLMs in different tasks, the researchers hope to uncover the internal mechanisms of these models when performing social reasoning, and whether they rely on shallow heuristic methods or deeper cognitive processes. Overall, this paper aims to systematically evaluate the performance of LLMs in Theory of Mind tasks in order to better understand the social - cognitive abilities of these models and explore the similarities and differences between them and human cognition.

Testing theory of mind in large language models and humans

Evaluating Large Language Models in Theory of Mind Tasks

Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests

LLMs achieve adult human performance on higher-order theory of mind tasks

Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?

Probing the Robustness of Theory of Mind in Large Language Models

Challenging large language models' " intelligence" with human tools: A neuropsychological investigation in Italian language on prefrontal functioning

Thinking Fast and Slow in Large Language Models

How FaR Are Large Language Models From Agents with Theory-of-Mind?

Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs

MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal Logic

ToMBench: Benchmarking Theory of Mind in Large Language Models

Large Language Models and the Reverse Turing Test

Theory of Mind May Have Spontaneously Emerged in Large Language Models

GPT-4o reads the mind in the eyes

Cognitive Effects in Large Language Models

Language models and psychological sciences

Do Large Language Models Know What Humans Know?

Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker

Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies

LLM Cognitive Judgements Differ From Human