Abstract:Facing the current debate on whether Large Language Models (LLMs) attain near-human intelligence levels (Mitchell & Krakauer, 2023; Bubeck et al., 2023; Kosinski, 2023; Shiffrin & Mitchell, 2023; Ullman, 2023), the current study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition. We developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP). Our approach also encompassed a computational model based on recursive Bayesian inference, adept at elucidating diverse human behavioral patterns. Extensive experiments and detailed analyses revealed that humans surpassed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities. Notably, GPT models demonstrated social intelligence only at the most basic order (order = 0), in stark contrast to human social intelligence (order >= 2). Further examination indicated a propensity of LLMs to rely on pattern recognition for shortcuts, casting doubt on their possession of authentic human-level social intelligence. Our codes, dataset, appendix and human data are released at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate whether large - language models (LLMs) can reach the human - level social intelligence. Specifically, the researchers developed a benchmark test framework to evaluate social intelligence, which is a very unique and important aspect of human cognition. By introducing two evaluation tasks - Inverse Reasoning (IR) and Inverse Inverse Planning (IIP), as well as a computational model based on recursive Bayesian inference, the paper aims to systematically compare the performance differences between humans and AI in terms of social intelligence. ### Main Problems 1. **Evaluating Social Intelligence**: In the current discussion on whether large - language models can approach the human intelligence level, social intelligence is a crucial but not yet fully explored area. The paper aims to fill this gap by designing specific tasks and models. 2. **Comparing Humans and AI**: Through experiments and analysis, the paper attempts to reveal the specific differences between humans and AI in terms of social intelligence, especially in terms of zero - shot learning, one - shot generalization, and multimodal adaptation capabilities. 3. **Understanding the Limitations of AI**: The research found that although LLMs perform well on some tasks, their performance is still limited in more complex social intelligence tasks, especially lacking high - level social cognitive abilities. ### Specific Tasks 1. **Inverse Reasoning (IR)**: - **Task Description**: In a 5x5 grid campus, agent A needs to find its most preferred food truck. Observer B infers A's preference order by analyzing A's action trajectory. - **Task Types**: - **Intermediate**: The agent stops without exploring all trucks, and the selected truck is considered its most preferred. - **Last**: The agent visits all trucks and then selects the last - seen truck. - **Previsited**: After viewing all trucks, the agent returns to a previously visited truck, indicating its preference order. 2. **Inverse Inverse Planning (IIP)**: - **Task Description**: In a 5x5 grid campus, agent A knows the locations of two restaurants and wishes to show its preference for one of them to observer B. A needs to choose an appropriate path to effectively convey its preference. - **Task Types**: - **Type I**: A circular path, revisiting one location and passing through the other restaurant. - **Type II**: A circular path that does not pass through the other restaurant. - **Type III**: A non - circular path passing through the other restaurant. - **Type IV**: A non - circular path that avoids passing through the other restaurant. ### Computational Model The paper proposes a computational framework based on recursive Bayesian inference for unified modeling of IR and IIP tasks. This framework simulates different levels of social interaction by recursively reasoning about the mental states of agents and observers. ### Experimental Results - **Human Participants**: 75 participants completed the IR and IIP tasks, and the results showed that humans are significantly superior to LLMs in terms of overall performance, zero - shot learning, one - shot generalization, and multimodal adaptation capabilities. - **LLMs**: GPT - 3.5 - Turbo and GPT - 4 perform poorly on these tasks, especially when dealing with unseen scenarios and high - level social cognitive tasks. ### Conclusion Through systematic evaluation and analysis, the paper reveals the significant differences between humans and AI in terms of social intelligence, especially the limitations of LLMs in high - level social cognitive tasks. This provides an important reference and direction for future artificial social intelligence (ASI) research.

Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

Large language models can outperform humans in social situational judgments

InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context

Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs

SocialAI: Benchmarking Socio-Cognitive Abilities in Deep Reinforcement Learning Agents

SocialAI 0.1: Towards a Benchmark to Stimulate Research on Socio-Cognitive Abilities in Deep Reinforcement Learning Agents

Towards Social AI: A Survey on Understanding Social Interactions

Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models

Emotional intelligence of Large Language Models

Go Beyond Plain Fine-tuning: Improving Pretrained Models for Social Commonsense

How to Measure the Intelligence of Large Language Models?

Artificial Social Intelligence: A Comparative and Holistic View

Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark

Human Simulacra: Benchmarking the Personification of Large Language Models

Are Large Language Models Aligned with People's Social Intuitions for Human-Robot Interactions?

The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks

AI for social science and social science of AI: A survey

Large language models as linguistic simulators and cognitive models in human research

Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

Socially-Minded Intelligence: How Individuals, Groups, and AI Systems Can Make Each-Other Smarter (or Not)