Challenging the Validity of Personality Tests for Large Language Models

Tom Sühr,Florian E. Dorner,Samira Samadi,Augustin Kelava
2024-06-05
Abstract:With large language models (LLMs) like GPT-4 appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate personality traits of LLMs using questionnaires originally developed for humans. While reusing measures is a resource-efficient way to evaluate LLMs, careful adaptations are usually required to ensure that assessment results are valid even across human subpopulations. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from human responses, implying that the results of these tests cannot be interpreted in the same way. Concretely, reverse-coded items ("I am introverted" vs. "I am extraverted") are often both answered affirmatively. Furthermore, variation across prompts designed to "steer" LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe that it is important to investigate tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' "personality".
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper investigates the phenomenon of using personality tests designed for humans to evaluate the personality traits of large language models (LLMs) such as GPT-4, after they have shown increasingly human-like behavior in text interactions. The study found that the responses of LLMs to personality tests systematically deviate from human responses, suggesting that the results of these tests cannot be interpreted in the same way. Specifically, reversed-coded items (e.g., "I am introverted" vs. "I am extroverted") often receive affirmative answers, and the variations in the Big Five personality factors do not follow the clear separations observed in human samples for different "personalities". The paper points out that when applying psychological measurement tools to new populations, such as from humans to LLMs, it is necessary to ensure measurement invariance, which refers to the preservation of the psychometric properties of the measurement tool across different groups. However, the evaluation of LLMs has not been sufficiently validated in this regard, which may lead to inappropriate inferences about the "personality" of LLMs. The paper presents experiments demonstrating that LLMs exhibit significantly different response patterns to personality tests compared to humans and fail to replicate the five-factor structure observed in human samples, highlighting the need to validate the effectiveness of the tests before conducting personality assessments on LLMs. The main contributions of the paper include the demonstration of abnormal response patterns by LLMs to well-known personality tests and their inability to replicate the five-factor structure when simulating different "personalities". The authors call for the analysis and adaptation of other psychological and educational tests to be applicable to LLMs. The paper also discusses the concepts of measurement invariance and nomological nets, with the former focusing on whether the measurement tool remains consistent across different groups, and the latter focusing on the relationships between the latent traits of measurement and other traits. The paper empirically demonstrates that current personality tests do not exhibit measurement invariance in LLMs, thus making them unsuitable for assessing the personality of LLMs.