Abstract:Large language models (LLMs) exhibit robust capabilities in text generation and comprehension, mimicking human behavior and exhibiting synthetic personalities. However, some LLMs have displayed offensive personality, propagating toxic discourse. Existing literature neglects the origin and evolution of LLM personalities, as well as the effective personality control. To fill these gaps, our study embarked on a comprehensive investigation into LLM personality control. We investigated several typical methods to influence LLMs, including three training methods: Continual Pre-training, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF), along with inference phase considerations (prompts). Our investigation revealed a hierarchy of effectiveness in control: Prompt > SFT > RLHF > Continual Pre-train. Notably, SFT exhibits a higher control success rate compared to prompt induction. While prompts prove highly effective, we found that prompt-induced personalities are less robust than those trained, making them more prone to showing conflicting personalities under reverse personality prompt induction. Besides, harnessing the strengths of both SFT and prompt, we proposed $\underline{\text{P}}$rompt $\underline{\text{I}}$nduction post $\underline{\text{S}}$upervised $\underline{\text{F}}$ine-tuning (PISF), which emerges as the most effective and robust strategy for controlling LLMs' personality, displaying high efficacy, high success rates, and high robustness. Even under reverse personality prompt induction, LLMs controlled by PISF still exhibit stable and robust personalities.
What problem does this paper attempt to address?
The main problem this paper attempts to address is the personality control of large language models (LLMs). Specifically, the paper focuses on the following aspects:
1. **Effectiveness of Personality Control**: Researchers explore how to effectively control the personality of LLMs so that they can exhibit specific personality traits or overall personality types. The paper proposes various methods, including Continual Pre-training, Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and the use of Prompts during the inference stage.
2. **Stability of Personality Control**: In addition to effectiveness, the study also focuses on the stability of personality control, i.e., whether the controlled personality can remain stable when faced with counter-personality prompts. The research finds significant differences in control effectiveness and stability among different methods.
3. **Specific Methods of Personality Control**: The paper proposes a new method—Prompt Induction post Supervised Fine-tuning (PISF), which combines the advantages of SFT and prompts. This method is proven to be the most effective and stable control method.
### Main Contributions
1. **Systematic Investigation**: The paper is the first to systematically investigate the factors affecting the personality of LLMs and propose effective control methods.
2. **Hierarchy of Control Effectiveness**: The study reveals a hierarchy in control effectiveness among different methods: Prompts > Supervised Fine-Tuning > Reinforcement Learning from Human Feedback > Continual Pre-training.
3. **PISF Method**: The PISF method is proposed, which excels in control effectiveness, success rate, and stability.
4. **Datasets and Evaluation Metrics**: Comprehensive datasets covering all personality traits and types are provided, along with multiple quantitative metrics to evaluate the control effectiveness of specific traits and personalities.
### Research Background
- **Personality Assessment Models**: The paper introduces two widely used personality assessment models—the Myers-Briggs Type Indicator (MBTI) and the Big Five personality traits.
- **Data Generation**: To construct the training dataset, researchers utilized prompt-induced LLMs to generate data and improved data quality through a two-stage method.
### Methodology
- **Dataset Construction**: Personality datasets were constructed for different training methods (Continual Pre-training, Supervised Fine-Tuning, Reinforcement Learning from Human Feedback).
- **Personality Assessment**: The MBTI model was used for personality assessment, multiple evaluation prompts were designed, and a five-point Likert scale was adopted to quantify the evaluation results.
- **Evaluation Metrics**: Several quantitative metrics were proposed, including Induced Success Rate (ISR), Trait Induction Efficiency (TIE), Trait Stability Efficiency (TSE), Personality Induction Success Rate (PISR), and Personality Induction Efficiency (PIE).
### Experimental Results
- **Control Effectiveness**: Prompts performed best in most cases, followed by Supervised Fine-Tuning, then Reinforcement Learning from Human Feedback, with Continual Pre-training performing the worst.
- **Stability**: Models controlled by Supervised Fine-Tuning showed higher stability when faced with counter-personality prompts, while models induced by prompts were more prone to personality changes.
- **PISF Method**: The PISF method outperformed other methods in control effectiveness, success rate, and stability, making it the most effective personality control method currently.
In summary, through systematic research and experiments, this paper not only reveals the performance of different methods in LLMs personality control but also proposes a new, more effective, and more stable method—PISF. This provides an important reference and foundation for future research and applications.