Boosting Theory-of-Mind Performance in Large Language Models via Prompting

Shima Rahimi Moghaddam,Christopher J. Honey
2023-04-26
Abstract:Large language models (LLMs) excel in many tasks in 2023, but they still face challenges in complex reasoning. Theory-of-mind (ToM) tasks, which require understanding agents' beliefs, goals, and mental states, are essential for common-sense reasoning involving humans, making it crucial to enhance LLM performance in this area. This study measures the ToM performance of GPT-4 and three GPT-3.5 variants (Davinci-2, Davinci-3, GPT-3.5-Turbo), and investigates the effectiveness of in-context learning in improving their ToM comprehension. We evaluated prompts featuring two-shot chain of thought reasoning and step-by-step thinking instructions. We found that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) (all models excluding Davinci-2) improved their ToM accuracy via in-context learning. GPT-4 performed best in zero-shot settings, reaching nearly 80% ToM accuracy, but still fell short of the 87% human accuracy on the test set. However, when supplied with prompts for in-context learning, all RLHF-trained LLMs exceeded 80% ToM accuracy, with GPT-4 reaching 100%. These results demonstrate that appropriate prompting enhances LLM ToM reasoning, and they underscore the context-dependent nature of LLM cognitive capacities.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: large - language models (LLMs) perform poorly in tasks that require understanding an individual's mental state (i.e., "theory of mind" or ToM tasks). These tasks require the model to be able to understand the beliefs, goals, and mental states of agents, which are crucial for common - sense reasoning involving humans. Therefore, the research aims to evaluate and improve the performance of LLMs in ToM tasks, especially by using appropriate prompting methods to enhance their performance. Specifically, the paper attempts to solve the problem through the following points: 1. **Evaluating the ToM performance of different LLMs**: The research selected GPT - 4 and three GPT - 3.5 variants (Davinci - 2, Davinci - 3, GPT - 3.5 - Turbo) and tested their performance in ToM tasks in a zero - sample setting. 2. **Exploring the effectiveness of in - context learning**: The researchers designed different prompting methods, including two - shot chain - of - thought (CoT) and step - by - step (SS), to evaluate whether these methods can effectively improve the ToM understanding ability of LLMs. 3. **Analyzing the effects of prompting methods**: Through experimental data, the researchers analyzed the impact of different prompting methods on the ToM performance of LLMs and explored why these methods are effective. 4. **Comparing the performance of humans and models**: The research also compared the performance of humans on the same tasks to evaluate the actual level of LLMs in ToM tasks. Through the above methods, the paper aims to reveal how to improve the performance of LLMs in complex reasoning tasks, especially those involving understanding an individual's mental state, through appropriate prompting strategies. This not only helps to improve the application of LLMs in related tasks but also provides valuable insights for further exploring the cognitive abilities of LLMs.