Abstract:Large language models (LLMs) excel in many tasks in 2023, but they still face challenges in complex reasoning. Theory-of-mind (ToM) tasks, which require understanding agents' beliefs, goals, and mental states, are essential for common-sense reasoning involving humans, making it crucial to enhance LLM performance in this area. This study measures the ToM performance of GPT-4 and three GPT-3.5 variants (Davinci-2, Davinci-3, GPT-3.5-Turbo), and investigates the effectiveness of in-context learning in improving their ToM comprehension. We evaluated prompts featuring two-shot chain of thought reasoning and step-by-step thinking instructions. We found that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) (all models excluding Davinci-2) improved their ToM accuracy via in-context learning. GPT-4 performed best in zero-shot settings, reaching nearly 80% ToM accuracy, but still fell short of the 87% human accuracy on the test set. However, when supplied with prompts for in-context learning, all RLHF-trained LLMs exceeded 80% ToM accuracy, with GPT-4 reaching 100%. These results demonstrate that appropriate prompting enhances LLM ToM reasoning, and they underscore the context-dependent nature of LLM cognitive capacities.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: large - language models (LLMs) perform poorly in tasks that require understanding an individual's mental state (i.e., "theory of mind" or ToM tasks). These tasks require the model to be able to understand the beliefs, goals, and mental states of agents, which are crucial for common - sense reasoning involving humans. Therefore, the research aims to evaluate and improve the performance of LLMs in ToM tasks, especially by using appropriate prompting methods to enhance their performance. Specifically, the paper attempts to solve the problem through the following points: 1. **Evaluating the ToM performance of different LLMs**: The research selected GPT - 4 and three GPT - 3.5 variants (Davinci - 2, Davinci - 3, GPT - 3.5 - Turbo) and tested their performance in ToM tasks in a zero - sample setting. 2. **Exploring the effectiveness of in - context learning**: The researchers designed different prompting methods, including two - shot chain - of - thought (CoT) and step - by - step (SS), to evaluate whether these methods can effectively improve the ToM understanding ability of LLMs. 3. **Analyzing the effects of prompting methods**: Through experimental data, the researchers analyzed the impact of different prompting methods on the ToM performance of LLMs and explored why these methods are effective. 4. **Comparing the performance of humans and models**: The research also compared the performance of humans on the same tasks to evaluate the actual level of LLMs in ToM tasks. Through the above methods, the paper aims to reveal how to improve the performance of LLMs in complex reasoning tasks, especially those involving understanding an individual's mental state, through appropriate prompting strategies. This not only helps to improve the application of LLMs in related tasks but also provides valuable insights for further exploring the cognitive abilities of LLMs.

Boosting Theory-of-Mind Performance in Large Language Models via Prompting

How FaR Are Large Language Models From Agents with Theory-of-Mind?

Evaluating Large Language Models in Theory of Mind Tasks

Metacognitive Prompting Improves Understanding in Large Language Models

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

PHAnToM: Persona-based Prompting Has An Effect on Theory-of-Mind Reasoning in Large Language Models

Think Twice: Perspective-Taking Improves Large Language Models' Theory-of-Mind Capabilities

Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests

Progressive-Hint Prompting Improves Reasoning in Large Language Models

Stress Testing Chain-of-Thought Prompting for Large Language Models

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Constrained Reasoning Chains for Enhancing Theory-of-Mind in Large Language Models

R$^3$ Prompting: Review, Rephrase and Resolve for Chain-of-Thought Reasoning in Large Language Models under Noisy Context

LLMs achieve adult human performance on higher-order theory of mind tasks

GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts

Large Language Models are Contrastive Reasoners

Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind

Think Beyond Size: Adaptive Prompting for More Effective Reasoning

Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models