Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Zhawnen Chen,Tianchun Wang,Yizhou Wang,Michal Kosinski,Xiang Zhang,Yun Fu,Sheng Li
2024-06-20
Abstract:Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The core issue this paper attempts to address is improving the Theory of Mind (ToM) reasoning ability of large language models (LLMs) in dynamic scenes. Specifically, current research on the ToM capabilities of LLMs mainly focuses on text tasks, while human ToM reasoning is usually based on dynamic scenes in time series. Therefore, the paper proposes a new approach to examine and enhance the ToM reasoning ability of LLMs in the time dimension through videos. ### Main Research Questions: 1. **ToM Reasoning Ability in Videos**: Investigate whether large multimodal models can possess emotional and social reasoning abilities like humans, and if so, how these abilities work. 2. **Video Frame Retrieval Task**: Design a video frame retrieval task to locate video segments related to ToM, thereby better understanding the psychological states of characters in the video. 3. **Model Architecture Innovation**: Propose a new architecture—Video Theory of Mind (VToM)—to model the evolution of psychological states in videos. ### Research Background: - **Text ToM Reasoning**: Existing research has found that large language models exhibit certain ToM reasoning abilities in text tasks, but these tasks usually lack a time dimension. - **Necessity of Video ToM Reasoning**: Human ToM reasoning is typically based on dynamic scenes, so introducing video as a new medium is crucial for more realistically assessing and enhancing ToM abilities. ### Methods and Experiments: - **ToM Localization (ToMLoc) Task**: Designed a new task aimed at retrieving the most relevant frames from videos to answer ToM-related questions. - **Model Architecture**: Proposed the VToM architecture, which combines textual and visual features to model the changes in psychological states of characters in videos. - **Experimental Setup**: Fine-tuned and evaluated the model using the Social-IQ 2.0 dataset, testing its performance in frame localization and question-answering tasks. ### Experimental Results: - **Frame Localization Task**: After fine-tuning with the ToMLoc dataset, the model's frame localization accuracy significantly improved. - **Question-Answering Task**: The model's performance also improved in the question-answering task on the Social-IQ 2.0 dataset, especially in ToM-related tasks. ### Discussion and Future Work: - **Importance of Fine-Grained Training Data**: High-quality, diverse video datasets are crucial for enhancing the ToM reasoning ability of the model. - **Future Directions**: Create and curate richer datasets, explore new model architectures to further improve performance and generalization ability. Overall, this paper introduces a new approach to evaluate and enhance the ToM reasoning ability of large multimodal models in dynamic scenes by incorporating video as a new medium, providing an important foundation for future research.