Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

Junqi Wang,Chunhui Zhang,Jiapeng Li,Yuxi Ma,Lixing Niu,Jiaheng Han,Yujia Peng,Yixin Zhu,Lifeng Fan
2024-05-20
Abstract:Facing the current debate on whether Large Language Models (LLMs) attain near-human intelligence levels (Mitchell & Krakauer, 2023; Bubeck et al., 2023; Kosinski, 2023; Shiffrin & Mitchell, 2023; Ullman, 2023), the current study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition. We developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP). Our approach also encompassed a computational model based on recursive Bayesian inference, adept at elucidating diverse human behavioral patterns. Extensive experiments and detailed analyses revealed that humans surpassed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities. Notably, GPT models demonstrated social intelligence only at the most basic order (order = 0), in stark contrast to human social intelligence (order >= 2). Further examination indicated a propensity of LLMs to rely on pattern recognition for shortcuts, casting doubt on their possession of authentic human-level social intelligence. Our codes, dataset, appendix and human data are released at
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate whether large - language models (LLMs) can reach the human - level social intelligence. Specifically, the researchers developed a benchmark test framework to evaluate social intelligence, which is a very unique and important aspect of human cognition. By introducing two evaluation tasks - Inverse Reasoning (IR) and Inverse Inverse Planning (IIP), as well as a computational model based on recursive Bayesian inference, the paper aims to systematically compare the performance differences between humans and AI in terms of social intelligence. ### Main Problems 1. **Evaluating Social Intelligence**: In the current discussion on whether large - language models can approach the human intelligence level, social intelligence is a crucial but not yet fully explored area. The paper aims to fill this gap by designing specific tasks and models. 2. **Comparing Humans and AI**: Through experiments and analysis, the paper attempts to reveal the specific differences between humans and AI in terms of social intelligence, especially in terms of zero - shot learning, one - shot generalization, and multimodal adaptation capabilities. 3. **Understanding the Limitations of AI**: The research found that although LLMs perform well on some tasks, their performance is still limited in more complex social intelligence tasks, especially lacking high - level social cognitive abilities. ### Specific Tasks 1. **Inverse Reasoning (IR)**: - **Task Description**: In a 5x5 grid campus, agent A needs to find its most preferred food truck. Observer B infers A's preference order by analyzing A's action trajectory. - **Task Types**: - **Intermediate**: The agent stops without exploring all trucks, and the selected truck is considered its most preferred. - **Last**: The agent visits all trucks and then selects the last - seen truck. - **Previsited**: After viewing all trucks, the agent returns to a previously visited truck, indicating its preference order. 2. **Inverse Inverse Planning (IIP)**: - **Task Description**: In a 5x5 grid campus, agent A knows the locations of two restaurants and wishes to show its preference for one of them to observer B. A needs to choose an appropriate path to effectively convey its preference. - **Task Types**: - **Type I**: A circular path, revisiting one location and passing through the other restaurant. - **Type II**: A circular path that does not pass through the other restaurant. - **Type III**: A non - circular path passing through the other restaurant. - **Type IV**: A non - circular path that avoids passing through the other restaurant. ### Computational Model The paper proposes a computational framework based on recursive Bayesian inference for unified modeling of IR and IIP tasks. This framework simulates different levels of social interaction by recursively reasoning about the mental states of agents and observers. ### Experimental Results - **Human Participants**: 75 participants completed the IR and IIP tasks, and the results showed that humans are significantly superior to LLMs in terms of overall performance, zero - shot learning, one - shot generalization, and multimodal adaptation capabilities. - **LLMs**: GPT - 3.5 - Turbo and GPT - 4 perform poorly on these tasks, especially when dealing with unseen scenarios and high - level social cognitive tasks. ### Conclusion Through systematic evaluation and analysis, the paper reveals the significant differences between humans and AI in terms of social intelligence, especially the limitations of LLMs in high - level social cognitive tasks. This provides an important reference and direction for future artificial social intelligence (ASI) research.