Revisiting the Reliability of Psychological Scales on Large Language Models

Jen-tse Huang,Wenxiang Jiao,Man Ho Lam,Eric John Li,Wenxuan Wang,Michael R. Lyu

2024-10-04

Abstract:Recent research has focused on examining Large Language Models' (LLMs) characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics. The administration of personality tests to LLMs has emerged as a noteworthy area in this context. However, the suitability of employing psychological scales, initially devised for humans, on LLMs is a matter of ongoing debate. Our study aims to determine the reliability of applying personality assessments to LLMs, explicitly investigating whether LLMs demonstrate consistent personality traits. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory, indicating a satisfactory level of reliability. Furthermore, our research explores the potential of GPT-3.5 to emulate diverse personalities and represent various groups-a capability increasingly sought after in social sciences for substituting human participants with LLMs to reduce costs. Our findings reveal that LLMs have the potential to represent different personalities with specific prompt instructions.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the reliability of large - language models (LLMs) on psychological scales. Specifically, the researchers aim to determine whether psychological measurement scales can be reliably applied to LLMs to assess whether these models exhibit consistent personality traits. The paper explores this issue by systematically analyzing the influence of five different factors (instruction templates, item restatements, language, selection labels, and selection order) on the response stability of LLMs. In addition, the study also explores the ability of LLMs to imitate different personality characteristics under specific prompts or situations, which has potential application value in social science research, especially in replacing human participants. The main contributions of the paper include: - For the first time, comprehensively analyzing the reliability of the application of psychological scales on LLMs through five different factors, proving that GPT - 3.5 - Turbo has stable and unique personality traits. - Demonstrating the potential of LLMs in simulating diverse populations, providing a new tool for social science research. - Developing a framework for evaluating the reliability of psychological scales on LLMs, laying the foundation for future research to verify the application of more types of scales on different LLMs.

Revisiting the Reliability of Psychological Scales on Large Language Models

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

Challenging the Validity of Personality Tests for Large Language Models

Personality testing of Large Language Models: Limited temporal stability, but highlighted prosociality

Identifying Multiple Personalities in Large Language Models with External Evaluation

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench

Personality Traits in Large Language Models

You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

LMLPA: Language Model Linguistic Personality Assessment

PersonaLLM: Investigating the Ability of GPT-3.5 to Express Personality Traits and Gender Differences

Can Large Language Models Assess Personality from Asynchronous Video Interviews? A Comprehensive Evaluation of Validity, Reliability, Fairness, and Rating Patterns

Have Large Language Models Developed a Personality?: Applicability of Self-Assessment Tests in Measuring Personality in LLMs

Illuminating the Black Box: A Psychometric Investigation into the Multifaceted Nature of Large Language Models

Self-assessment, Exhibition, and Recognition: a Review of Personality in Large Language Models

Humanity in AI: Detecting the Personality of Large Language Models

Evaluating Psychological Safety of Large Language Models

Perils and opportunities in using large language models in psychological research

Eliciting Big Five Personality Traits in Large Language Models: A Textual Analysis with Classifier-Driven Approach

Using large language models in psychology