Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models

Yuan Li,Yue Huang,Hongyi Wang,Xiangliang Zhang,James Zou,Lichao Sun
2024-06-26
Abstract:Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. The broader integration of LLMs into society has sparked interest in whether they manifest psychological attributes, and whether these attributes are stable-inquiries that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a framework for investigating psychology in LLMs, including psychological dimension identification, assessment dataset curation, and assessment with results validation. Following this framework, we introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence. This benchmark includes thirteen datasets featuring diverse scenarios and item types. Our findings indicate that LLMs manifest a broad spectrum of psychological attributes. We also uncover discrepancies between LLMs' self-reported traits and their behaviors in real-world scenarios. This paper demonstrates a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate whether large language models (LLMs) exhibit human - like psychological attributes and to explore the stability and reliability of these attributes. Specifically, the researchers hope to answer the following key questions: 1. **Existence of Psychological Attributes**: Do LLMs exhibit human - like psychological attributes (such as personality, values, emotions, theory of mind, motivation, and intelligence), and can these attributes be quantified through a systematic evaluation framework? 2. **Stability of Attributes**: Are these psychological attributes stable in different situations? For example, do LLMs exhibit consistent behavior patterns when facing similar situations? 3. **Consistency between Self - Report and Actual Behavior**: Are there differences between the self - reports of LLMs and their behaviors in real - life scenarios? That is, are their performances in closed - ended and open - ended questions consistent? 4. **Reliability of Evaluation Methods**: Are the existing psychological measurement tools applicable to LLMs? Can these tools reliably evaluate the psychological attributes of LLMs? To answer these questions, the authors proposed a comprehensive psychometric benchmark framework covering six psychological dimensions (personality, values, emotions, theory of mind, motivation, and intelligence) and used 13 datasets for evaluation. This framework aims to provide a systematic method to deeply understand the behavior patterns of LLMs, thereby providing valuable insights for the fields of AI and social sciences. ### Main Findings - **Consistency**: LLMs exhibit consistent behavior in tasks requiring reasoning, but show significant variability in preference - type questions without clear answers. - **Differences between Closed - ended and Open - ended Responses**: Some LLMs score low in closed - ended evaluations but show different traits in open - ended responses. For example, a certain model has a low introversion score in closed - ended evaluations but shows extroverted traits in open - ended responses. - **Position Bias and Prompt Sensitivity**: Different models have different sensitivities to option positions and prompt changes. Some models are particularly vulnerable to prompt perturbations when facing challenging questions. - **Reliability of LLMs as Raters**: The open - ended question evaluation using GPT - 4 and Llama3 - 70b as raters shows high consistency, indicating the potential applicability of this method in similar evaluation scenarios. Through these findings, this study not only reveals the complexity of the psychological attributes of LLMs but also provides directions for the development of more reliable psychometric tools to ensure that the behavior of LLMs is more predictable and controllable in various application scenarios.