Abstract:Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. The broader integration of LLMs into society has sparked interest in whether they manifest psychological attributes, and whether these attributes are stable-inquiries that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a framework for investigating psychology in LLMs, including psychological dimension identification, assessment dataset curation, and assessment with results validation. Following this framework, we introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence. This benchmark includes thirteen datasets featuring diverse scenarios and item types. Our findings indicate that LLMs manifest a broad spectrum of psychological attributes. We also uncover discrepancies between LLMs' self-reported traits and their behaviors in real-world scenarios. This paper demonstrates a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate whether large language models (LLMs) exhibit human - like psychological attributes and to explore the stability and reliability of these attributes. Specifically, the researchers hope to answer the following key questions: 1. **Existence of Psychological Attributes**: Do LLMs exhibit human - like psychological attributes (such as personality, values, emotions, theory of mind, motivation, and intelligence), and can these attributes be quantified through a systematic evaluation framework? 2. **Stability of Attributes**: Are these psychological attributes stable in different situations? For example, do LLMs exhibit consistent behavior patterns when facing similar situations? 3. **Consistency between Self - Report and Actual Behavior**: Are there differences between the self - reports of LLMs and their behaviors in real - life scenarios? That is, are their performances in closed - ended and open - ended questions consistent? 4. **Reliability of Evaluation Methods**: Are the existing psychological measurement tools applicable to LLMs? Can these tools reliably evaluate the psychological attributes of LLMs? To answer these questions, the authors proposed a comprehensive psychometric benchmark framework covering six psychological dimensions (personality, values, emotions, theory of mind, motivation, and intelligence) and used 13 datasets for evaluation. This framework aims to provide a systematic method to deeply understand the behavior patterns of LLMs, thereby providing valuable insights for the fields of AI and social sciences. ### Main Findings - **Consistency**: LLMs exhibit consistent behavior in tasks requiring reasoning, but show significant variability in preference - type questions without clear answers. - **Differences between Closed - ended and Open - ended Responses**: Some LLMs score low in closed - ended evaluations but show different traits in open - ended responses. For example, a certain model has a low introversion score in closed - ended evaluations but shows extroverted traits in open - ended responses. - **Position Bias and Prompt Sensitivity**: Different models have different sensitivities to option positions and prompt changes. Some models are particularly vulnerable to prompt perturbations when facing challenging questions. - **Reliability of LLMs as Raters**: The open - ended question evaluation using GPT - 4 and Llama3 - 70b as raters shows high consistency, indicating the potential applicability of this method in similar evaluation scenarios. Through these findings, this study not only reveals the complexity of the psychological attributes of LLMs but also provides directions for the development of more reliable psychometric tools to ensure that the behavior of LLMs is more predictable and controllable in various application scenarios.

Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models

AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories

LMLPA: Language Model Linguistic Personality Assessment

Illuminating the Black Box: A Psychometric Investigation into the Multifaceted Nature of Large Language Models

I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench

PsychoLex: Unveiling the Psychological Mind of Large Language Models

Emotional intelligence of Large Language Models

Self-assessment, Exhibition, and Recognition: a Review of Personality in Large Language Models

Human Simulacra: Benchmarking the Personification of Large Language Models

CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods

Personality Traits in Large Language Models

Humanity in AI: Detecting the Personality of Large Language Models

Revisiting the Reliability of Psychological Scales on Large Language Models

PsyEval: A Suite of Mental Health Related Tasks for Evaluating Large Language Models

Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Psy-LLM: Scaling up Global Mental Health Psychological Services with AI-based Large Language Models

Identifying Multiple Personalities in Large Language Models with External Evaluation

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

A User-Centric Benchmark for Evaluating Large Language Models.