Abstract:With large language models (LLMs) like GPT-4 appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate personality traits of LLMs using questionnaires originally developed for humans. While reusing measures is a resource-efficient way to evaluate LLMs, careful adaptations are usually required to ensure that assessment results are valid even across human subpopulations. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from human responses, implying that the results of these tests cannot be interpreted in the same way. Concretely, reverse-coded items ("I am introverted" vs. "I am extraverted") are often both answered affirmatively. Furthermore, variation across prompts designed to "steer" LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe that it is important to investigate tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' "personality".

What problem does this paper attempt to address?

This paper investigates the phenomenon of using personality tests designed for humans to evaluate the personality traits of large language models (LLMs) such as GPT-4, after they have shown increasingly human-like behavior in text interactions. The study found that the responses of LLMs to personality tests systematically deviate from human responses, suggesting that the results of these tests cannot be interpreted in the same way. Specifically, reversed-coded items (e.g., "I am introverted" vs. "I am extroverted") often receive affirmative answers, and the variations in the Big Five personality factors do not follow the clear separations observed in human samples for different "personalities". The paper points out that when applying psychological measurement tools to new populations, such as from humans to LLMs, it is necessary to ensure measurement invariance, which refers to the preservation of the psychometric properties of the measurement tool across different groups. However, the evaluation of LLMs has not been sufficiently validated in this regard, which may lead to inappropriate inferences about the "personality" of LLMs. The paper presents experiments demonstrating that LLMs exhibit significantly different response patterns to personality tests compared to humans and fail to replicate the five-factor structure observed in human samples, highlighting the need to validate the effectiveness of the tests before conducting personality assessments on LLMs. The main contributions of the paper include the demonstration of abnormal response patterns by LLMs to well-known personality tests and their inability to replicate the five-factor structure when simulating different "personalities". The authors call for the analysis and adaptation of other psychological and educational tests to be applicable to LLMs. The paper also discusses the concepts of measurement invariance and nomological nets, with the former focusing on whether the measurement tool remains consistent across different groups, and the latter focusing on the relationships between the latent traits of measurement and other traits. The paper empirically demonstrates that current personality tests do not exhibit measurement invariance in LLMs, thus making them unsuitable for assessing the personality of LLMs.

Challenging the Validity of Personality Tests for Large Language Models

Have Large Language Models Developed a Personality?: Applicability of Self-Assessment Tests in Measuring Personality in LLMs

Personality Traits in Large Language Models

Revisiting the Reliability of Psychological Scales on Large Language Models

You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

Identifying Multiple Personalities in Large Language Models with External Evaluation

Self-Assessment Tests are Unreliable Measures of LLM Personality

Humanity in AI: Detecting the Personality of Large Language Models

Personality testing of Large Language Models: Limited temporal stability, but highlighted prosociality

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

Can Large Language Models Assess Personality from Asynchronous Video Interviews? A Comprehensive Evaluation of Validity, Reliability, Fairness, and Rating Patterns

Eliciting Big Five Personality Traits in Large Language Models: A Textual Analysis with Classifier-Driven Approach

Eliciting Personality Traits in Large Language Models

PersonaLLM: Investigating the Ability of GPT-3.5 to Express Personality Traits and Gender Differences

Can ChatGPT Assess Human Personalities? A General Evaluation Framework

Is Self-knowledge and Action Consistent or Not: Investigating Large Language Model's Personality

Illuminating the Black Box: A Psychometric Investigation into the Multifaceted Nature of Large Language Models

Large Language Models Can Infer Psychological Dispositions of Social Media Users

Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models