Abstract:Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the robustness issue of preference models in the process of value alignment. Specifically, the authors explore how changes in the probabilities of certain preferences affect the prediction results of other preferences. By analyzing widely - used preference models (such as the Bradley - Terry model and the Plackett - Luce model), they reveal that under certain conditions, preference models are highly sensitive to small changes, especially when certain preferences are dominant (i.e., the probabilities are close to 0 or 1). This sensitivity may significantly affect the probabilities of other preferences, thereby threatening the robustness and security of value alignment in AI systems. ### Main research questions - **Core question**: In common preference models, how do changes in the probabilities of certain preferences affect the model's predictions of other preferences? - **Specific objectives**: - Analyze the probability sensitivity of preference models theoretically. - Explore the practical impacts of these sensitivities on the robustness and security in the value - alignment process. - Compare the sensitivity differences between different preference models (such as the Bradley - Terry model and the K - tuple Plackett - Luce model). ### Research methods 1. **Definitions and assumptions**: - Define K - ary preferences and their probability models. - Assume that preference models only depend on the differences in option scores. 2. **Analysis of general pairwise preference models**: - Derive that the probability of any given preference can be expressed as a function of the probabilities of other preferences. - Analyze the sensitivities of these functions, especially the change situations when certain preferences are close to being dominant. 3. **Analysis of specific models**: - **Bradley - Terry model**: Analyze its sensitivity in detail and give specific M - sensitive regions. - **K - tuple Plackett - Luce model**: Extend the analysis and prove that it is more robust than the Bradley - Terry model. 4. **Experimental verification**: - Use synthetic data sets to train AI agents (such as LLM) to verify the results of theoretical analysis, especially the sensitivity performance of the model in the presence of dominant preferences. ### Key findings - In the Bradley - Terry model and the Plackett - Luce model, changes in the probabilities of certain preferences may significantly affect the predictions of other preferences, especially when these preferences are close to being dominant. - The K - tuple Plackett - Luce model (K > 2) is more robust than the pairwise Bradley - Terry model, but using longer preference tuples will increase the cost of data collection. - The experimental results show that the trained model exhibits significant sensitivity in the presence of dominant preferences, even if the training sample distribution has only a slight change. ### Practical significance - **Robustness and security**: Understanding the sensitivity of preference models is crucial for ensuring the robustness and security of AI systems, especially in dynamic or uncertain environments. - **Trade - off choices**: When dealing with dominant preferences, a trade - off needs to be made between the robustness and expressiveness of the model. For example, a model that suppresses unsafe behaviors in a limited domain can choose to handle dominant preferences, while a general - purpose model in a wide domain may need to weaken these preferences to improve robustness. Through these studies, the authors provide an important theoretical basis and practical guidance for improving value alignment in AI systems.

Strong Preferences Affect the Robustness of Value Alignment

Strong and weak alignment of large language models with human values

Understanding the Learning Dynamics of Alignment with Human Feedback

Beyond Preferences in AI Alignment

Panacea: Pareto Alignment via Preference Adaptation for LLMs

Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages?

ROPO: Robust Preference Optimization for Large Language Models

Impact of Preference Noise on the Alignment Performance of Generative Language Models

How Ethical Should AI Be? How AI Alignment Shapes the Risk Preferences of LLMs

On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models

Value alignment: a formal approach

CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models

On Diversified Preferences of Large Language Model Alignment

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

AI Alignment with Changing and Influenceable Reward Functions

The benefits, risks and bounds of personalizing the alignment of large language models to individuals

Heterogeneous Value Alignment Evaluation for Large Language Models

Dissecting Human and LLM Preferences

Towards an End-to-End Personal Fine-Tuning Framework for AI Value Alignment