Abstract:Aligning large language models (LLMs) with human preferences has been recognized as the key to improving LLMs' interaction quality. However, in this pluralistic world, human preferences can be diversified due to annotators' different tastes, which hinders the effectiveness of LLM alignment methods. This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes, from 1.3 billion to 7 billion parameters, trained with human feedback exhibiting diverse preferences. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them. To mitigate the impact of diverse preferences, we introduce a new metric, Expected Calibration Error (ECE), to evaluate RMs and show their obvious positive correlation with the alignment performance of LLMs. Furthermore, we propose a Multi-Objective Reward learning method (MORE) to enhance the calibration performance of RMs on shared preferences. Through experiments on four models and five human preference datasets, we find the calibration error can be adopted as a key metric for evaluating RMs and MORE can obtain superior alignment performance.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issues encountered when aligning large language models (LLMs) with human preferences, particularly the challenges faced when training reward models (RMs) on diverse human preference datasets. Specifically, the paper focuses on the following aspects: 1. **Diverse Preferences**: In the real world, different people may have different preferences on the same topic, influenced by personal experiences, educational background, religion, and culture. This diversity in preferences can negatively impact the alignment of LLMs. 2. **Impact of Model Capacity and Data Volume**: The paper explores the performance of models of different sizes (ranging from 1.3 billion parameters to 7 billion parameters) in handling diverse preferences. Larger models generally cope better with diverse preferences, while smaller models struggle to adapt. 3. **Calibration Performance**: The paper introduces a new metric—Expected Calibration Error (ECE)—to evaluate the performance of reward models and finds a significant positive correlation between ECE and the alignment performance of LLMs. 4. **Multi-Objective Reward Learning Method (MORE)**: To mitigate the impact of diverse preferences on model performance, the paper proposes a multi-objective reward learning method (MORE). This method improves the calibration performance of reward models by minimizing reward drift through reweighting techniques. ### Main Contributions 1. **Revealing the Relationship Between Calibration Performance and Alignment Performance**: The paper is the first to demonstrate a positive correlation between the calibration performance of reward models and the alignment performance of large language models. Additionally, learning reward models on diverse preference datasets typically leads to higher calibration errors, indicating unreliable reward values. 2. **Proposing the Multi-Objective Reward Training Scheme (MORE)**: MORE alleviates reward drift by adaptively adjusting the learning gradients of reward models, significantly improving their calibration performance, especially on shared preferences. 3. **Experimental Validation**: The paper conducts experiments on multiple widely recognized and diverse preference datasets to validate the effectiveness of MORE. The results show that MORE significantly reduces reward drift and achieves lower Expected Calibration Error (ECE) values. ### Experimental Results - **Reward Accuracy**: On mixed diverse preference datasets, larger LLMs (e.g., LLaMa2-7B) can maintain high reward accuracy. However, smaller models (e.g., Pythia-1.4B) perform poorly on some datasets. - **Calibration Performance**: Although larger models can maintain high reward accuracy, mixed diverse preference datasets affect reward distribution, leading to a decline in calibration performance. The paper further validates this through the ECE metric. - **Effectiveness of MORE**: MORE shows significant improvements in calibration performance across all preference datasets, especially on shared Helpful&Harmless preferences. This indicates that MORE can more accurately capture shared preferences and reduce calibration errors. ### Conclusion By introducing the ECE metric and the MORE method, this paper effectively addresses the impact of diverse preferences on the alignment performance of large language models, providing new insights and tools for future LLM alignment research.

On Diversified Preferences of Large Language Model Alignment

MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

Prior Constraints-based Reward Model Training for Aligning Large Language Models

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Understanding the Learning Dynamics of Alignment with Human Feedback

Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment

Panacea: Pareto Alignment via Preference Adaptation for LLMs

Aligning Crowd Feedback via Distributional Preference Reward Modeling

On the Calibration of Large Language Models and Alignment

ABC Align: Large Language Model Alignment for Safety & Accuracy

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment

RMB: Comprehensively Benchmarking Reward Models in LLM Alignment

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

Aligning Large Language Models via Fine-grained Supervision

Dissecting Human and LLM Preferences

Progressively Label Enhancement for Large Language Model Alignment

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Constructive Large Language Models Alignment with Diverse Feedback

Transforming and Combining Rewards for Aligning Large Language Models

Aligning Large Language Models with Human: A Survey