The Better Angels of Machine Personality: How Personality Relates to LLM Safety

Jie Zhang,Dongrui Liu,Chen Qian,Ziyue Gan,Yong Liu,Yu Qiao,Jing Shao
2024-07-17
Abstract:Personality psychologists have analyzed the relationship between personality and safety behaviors in human society. Although Large Language Models (LLMs) demonstrate personality traits, the relationship between personality traits and safety abilities in LLMs still remains a mystery. In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i.e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale. Meanwhile, the safety alignment generally increases various LLMs' Extraversion, Sensing, and Judging traits. According to such findings, we can edit LLMs' personality traits and improve their safety performance, e.g., inducing personality from ISTJ to ISTP resulted in a relative improvement of approximately 43% and 10% in privacy and fairness performance, respectively. Additionally, we find that LLMs with different personality traits are differentially susceptible to jailbreak. This study pioneers the investigation of LLM safety from a personality perspective, providing new insights into LLM safety enhancement.
Computation and Language,Computers and Society
What problem does this paper attempt to address?
This paper aims to explore the relationship between the personality traits of large language models (LLMs) and their safety, and attempts to improve their safety by editing the personality traits of LLMs. Specifically: 1. **Research Background and Motivation**: Psychologists have analyzed the relationship between personality and safety behaviors in human society. Although large language models exhibit personality traits, the relationship between these traits and the models' safety capabilities (such as toxicity, privacy, and fairness) remains unknown. 2. **Main Findings**: - The paper finds that the personality traits of LLMs are closely related to their safety. Based on a reliable MBTI-M scale, the study finds that adjusting specific personality traits can significantly improve the safety performance of LLMs. - Specifically, changing the personality of LLMs from ISTJ to ISTP improved privacy and fairness performance by approximately 43% and 10%, respectively. - LLMs with different personality traits also show varying susceptibility to adversarial attacks, with models exhibiting more extroversion, intuition, and feeling traits being more easily compromised. 3. **Methodology**: To assess the personality traits of LLMs, researchers selected the latest MBTI-M scale and conducted multiple evaluations to ensure the reliability of the results. Additionally, by comparing the changes in LLMs' personality traits before and after alignment, it was found that the alignment process generally increases the extroversion, sensing, and judging traits of LLMs. 4. **Experimental Validation**: The effectiveness of this method was validated by controllably editing the personality traits of LLMs using the steering vector technique. The results show that by changing specific personality dimensions, the safety performance of the model can be enhanced while keeping other personality dimensions relatively unchanged. In summary, this paper aims to reveal the relationship between the personality traits of LLMs and their safety, and proposes a new method to enhance the safety of LLMs by editing their personality traits. This provides new perspectives and ideas for future research on the safety and alignment of LLMs.