Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?

Mudit Verma,Siddhant Bhambri,Subbarao Kambhampati
DOI: https://doi.org/10.1145/3610978.3640767
2024-01-18
Abstract:Large Language Models have shown exceptional generative abilities in various natural language and generation tasks. However, possible anthropomorphization and leniency towards failure cases have propelled discussions on emergent abilities of Large Language Models especially on Theory of Mind (ToM) abilities in Large Language Models. While several false-belief tests exists to verify the ability to infer and maintain mental models of another entity, we study a special application of ToM abilities that has higher stakes and possibly irreversible consequences : Human Robot Interaction. In this work, we explore the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot's generated behavior in a manner similar to human observer. We focus on four behavior types, namely - explicable, legible, predictable, and obfuscatory behavior which have been extensively used to synthesize interpretable robot behaviors. The LLMs goal is, therefore to be a human proxy to the agent, and to answer how a certain agent behavior would be perceived by the human in the loop, for example "Given a robot's behavior X, would the human observer find it explicable?". We conduct a human subject study to verify that the users are able to correctly answer such a question in the curated situations (robot setting and plan) across five domains. A first analysis of the belief test yields extremely positive results inflating ones expectations of LLMs possessing ToM abilities. We then propose and perform a suite of perturbation tests which breaks this illusion, i.e. Inconsistent Belief, Uninformative Context and Conviction Test. We conclude that, the high score of LLMs on vanilla prompts showcases its potential use in HRI settings, however to possess ToM demands invariance to trivial or irrelevant perturbations in the context which LLMs lack.
Robotics,Artificial Intelligence,Human-Computer Interaction
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate whether large - language models (LLMs) have the theory - of - mind (ToM) ability in human - robot interaction (HRI), especially whether these models can understand human mental states and, based on these mental states, evaluate whether the robot's behavior meets human expectations. Specifically, the paper tests the performance of LLMs in the perceived behavior recognition task (PROBE) through a series of experiments to explore whether these models can simulate human theory - of - mind ability in human - robot interaction scenarios. ### Main research questions: 1. **How do human participants perform in ToM reasoning tasks?** - Research on human participants' perception and judgment ability of robot behavior in different human - robot interaction scenarios. 2. **How does the performance of LLMs align with that of human participants and the performance in an ideal situation?** - Compare the performance of LLMs in the same tasks with that of human participants and the performance in an ideal situation, and evaluate the ToM ability of LLMs. 3. **How robust are LLMs in the perceived mental - model reasoning task?** - Test the stability of LLMs in ToM tasks by introducing different perturbation strategies (such as irrelevant background information and inconsistent beliefs). ### Research methods: 1. **Construct ToM tasks**: - Design five different domains (Fetch Robot, Passage Gridworld, Environment Design, Urban Search and Rescue, Package Delivery), and each domain contains four behavior types (interpretability, comprehensibility, predictability, and confusion). - Construct 20 different scenarios, and each scenario describes a specific robot behavior and the perspective of a human observer. 2. **Human participant research**: - Recruit 120 participants through an online survey platform, and conduct multiple - choice questions and Likert - scale questionnaires to evaluate human participants' perception and judgment of robot behavior. 3. **Robustness test of LLMs**: - Design two perturbation strategies (irrelevant background information and inconsistent beliefs) and a belief test to evaluate the performance of LLMs under different conditions. ### Research contributions: 1. **For the first time, systematically study the ToM ability of LLMs in human - robot interaction**. 2. **Construct rich test scenarios**, covering multiple behavior types in multiple domains. 3. **Conduct large - scale human participant research**, verifying the effectiveness of test scenarios. 4. **Propose new test methods**, revealing the vulnerability of LLMs in ToM tasks. ### Conclusion: Preliminary results show that LLMs perform well in standard tasks, but when perturbations are introduced, their performance drops significantly, indicating that the ToM ability of LLMs may be an "illusion" and lack robustness in complex real - world scenarios. This finding is of great significance for the future use of LLMs as human agents in human - robot interaction.