A Data Set of Synthetic Utterances for Computational Personality Analysis

Yair Neuman,Yochai Cohen
DOI: https://doi.org/10.1038/s41597-024-03488-6
2024-06-13
Scientific Data
Abstract:The computational analysis of human personality has mainly focused on the Big Five personality theory, and the psychodynamic approach is almost nonexistent despite its rich theoretical grounding and relevance to various tasks. Here, we provide a data set of 4972 synthetic utterances corresponding with five personality dimensions described by the psychodynamic approach: depressive, obsessive, paranoid, narcissistic, and anti-social psychopathic. The utterances have been generated through AI with a deep theoretical orientation that motivated the design of prompts for GPT-4. The dataset has been validated through 14 tests, and it may be relevant for the computational study of human personality and the design of authentic persona in digital domains, from gaming to the artistic generation of movie characters.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper attempts to address the issue of providing a high-quality dataset for the study of psychodynamic approaches in computational personality analysis. Specifically, the paper constructs this dataset by generating synthetic discourses related to five specific personality types (depression, obsessive-compulsive disorder, antisocial psychopathy, paranoia, and narcissism). Due to the current lack of large-scale personality type discourse datasets based on psychodynamic methods, this study fills that gap and supports research in multiple fields such as computational psychology, dialogue agent development, and character design in the gaming industry. The main objectives of the paper include: 1. Constructing and validating a dataset containing 4972 synthetic discourses corresponding to the five personality dimensions described by psychodynamic methods. 2. Ensuring the quality of the dataset through multiple validation methods, including human annotator evaluation, computational tool validation (such as LangChain and GPT-4), machine learning model classification, and ecological validity testing. 3. Demonstrating the potential of this dataset in various application scenarios, such as generating game characters with specific personality traits, analyzing conversations of elderly individuals to identify symptoms of depression, etc.