Abstract:Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.

What problem does this paper attempt to address?

The core issue this paper attempts to address is improving the Theory of Mind (ToM) reasoning ability of large language models (LLMs) in dynamic scenes. Specifically, current research on the ToM capabilities of LLMs mainly focuses on text tasks, while human ToM reasoning is usually based on dynamic scenes in time series. Therefore, the paper proposes a new approach to examine and enhance the ToM reasoning ability of LLMs in the time dimension through videos. ### Main Research Questions: 1. **ToM Reasoning Ability in Videos**: Investigate whether large multimodal models can possess emotional and social reasoning abilities like humans, and if so, how these abilities work. 2. **Video Frame Retrieval Task**: Design a video frame retrieval task to locate video segments related to ToM, thereby better understanding the psychological states of characters in the video. 3. **Model Architecture Innovation**: Propose a new architecture—Video Theory of Mind (VToM)—to model the evolution of psychological states in videos. ### Research Background: - **Text ToM Reasoning**: Existing research has found that large language models exhibit certain ToM reasoning abilities in text tasks, but these tasks usually lack a time dimension. - **Necessity of Video ToM Reasoning**: Human ToM reasoning is typically based on dynamic scenes, so introducing video as a new medium is crucial for more realistically assessing and enhancing ToM abilities. ### Methods and Experiments: - **ToM Localization (ToMLoc) Task**: Designed a new task aimed at retrieving the most relevant frames from videos to answer ToM-related questions. - **Model Architecture**: Proposed the VToM architecture, which combines textual and visual features to model the changes in psychological states of characters in videos. - **Experimental Setup**: Fine-tuned and evaluated the model using the Social-IQ 2.0 dataset, testing its performance in frame localization and question-answering tasks. ### Experimental Results: - **Frame Localization Task**: After fine-tuning with the ToMLoc dataset, the model's frame localization accuracy significantly improved. - **Question-Answering Task**: The model's performance also improved in the question-answering task on the Social-IQ 2.0 dataset, especially in ToM-related tasks. ### Discussion and Future Work: - **Importance of Fine-Grained Training Data**: High-quality, diverse video datasets are crucial for enhancing the ToM reasoning ability of the model. - **Future Directions**: Create and curate richer datasets, explore new model architectures to further improve performance and generalization ability. Overall, this paper introduces a new approach to evaluate and enhance the ToM reasoning ability of large multimodal models in dynamic scenes by incorporating video as a new medium, providing an important foundation for future research.

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

Language Models Represent Beliefs of Self and Others

Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

TimeToM: Temporal Space is the Key to Unlocking the Door of Large Language Models' Theory-of-Mind

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

How FaR Are Large Language Models From Agents with Theory-of-Mind?

Think Twice: Perspective-Taking Improves Large Language Models' Theory-of-Mind Capabilities

Look, Remember and Reason: Grounded reasoning in videos with language models

Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker

VideoLLM: Modeling Video Sequence with Large Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

Understanding Long Videos with Multimodal Language Models

MMToM-QA: Multimodal Theory of Mind Question Answering

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies