Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation

Adithya V Ganesan,Yash Kumar Lal,August Håkan Nilsson,H. Andrew Schwartz
2023-06-02
Abstract:Very large language models (LLMs) perform extremely well on a spectrum of NLP tasks in a zero-shot setting. However, little is known about their performance on human-level NLP problems which rely on understanding psychological concepts, such as assessing personality traits. In this work, we investigate the zero-shot ability of GPT-3 to estimate the Big 5 personality traits from users' social media posts. Through a set of systematic experiments, we find that zero-shot GPT-3 performance is somewhat close to an existing pre-trained SotA for broad classification upon injecting knowledge about the trait in the prompts. However, when prompted to provide fine-grained classification, its performance drops to close to a simple most frequent class (MFC) baseline. We further analyze where GPT-3 performs better, as well as worse, than a pretrained lexical model, illustrating systematic errors that suggest ways to improve LLMs on human-level NLP tasks.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the ability of large - language models (LLMs), especially GPT - 3, to estimate personality traits in a zero - sample setting. Specifically, the researchers are concerned with whether GPT - 3 can estimate the "Big Five" personality traits (openness, conscientiousness, extraversion, agreeableness, and neuroticism) from users' social media posts. Through a series of systematic experiments, they explored how injecting different types of knowledge about personality traits (such as definitions, lists of related words, and questionnaire item descriptions) into the prompts affects the performance of GPT - 3. The main contributions of the paper are: 1. It explored what information about personality is useful for GPT - 3. 2. It compared the performance of GPT - 3 with the current state - of - the - art methods (such as dictionary - based methods) in estimating personality traits. 3. It analyzed the relationship between the orderliness of result labels and the model performance. 4. It examined whether GPT - 3's predictions remain consistent when similar external knowledge is provided. The study found that when the task is simplified to a binary - classification problem, GPT - 3 performs relatively well; when the task becomes a more fine - grained three - classification problem, its performance drops significantly. In addition, the study also pointed out that GPT - 3 performs better on some specific personality traits, especially when using questionnaire item descriptions (ITEMDESC) as input. However, overall, the average performance of GPT - 3 in the zero - sample setting is still lower than that of highly - trained supervised models (such as WT - LEX). These findings help to understand the capabilities and limitations of LLMs in handling human - level natural language processing tasks and provide directions for improvement in future research.