On Diversified Preferences of Large Language Model Alignment

Dun Zeng,Yong Dai,Pengyu Cheng,Longyue Wang,Tianhao Hu,Wanshun Chen,Nan Du,Zenglin Xu
2024-10-05
Abstract:Aligning large language models (LLMs) with human preferences has been recognized as the key to improving LLMs' interaction quality. However, in this pluralistic world, human preferences can be diversified due to annotators' different tastes, which hinders the effectiveness of LLM alignment methods. This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes, from 1.3 billion to 7 billion parameters, trained with human feedback exhibiting diverse preferences. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them. To mitigate the impact of diverse preferences, we introduce a new metric, Expected Calibration Error (ECE), to evaluate RMs and show their obvious positive correlation with the alignment performance of LLMs. Furthermore, we propose a Multi-Objective Reward learning method (MORE) to enhance the calibration performance of RMs on shared preferences. Through experiments on four models and five human preference datasets, we find the calibration error can be adopted as a key metric for evaluating RMs and MORE can obtain superior alignment performance.
Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issues encountered when aligning large language models (LLMs) with human preferences, particularly the challenges faced when training reward models (RMs) on diverse human preference datasets. Specifically, the paper focuses on the following aspects: 1. **Diverse Preferences**: In the real world, different people may have different preferences on the same topic, influenced by personal experiences, educational background, religion, and culture. This diversity in preferences can negatively impact the alignment of LLMs. 2. **Impact of Model Capacity and Data Volume**: The paper explores the performance of models of different sizes (ranging from 1.3 billion parameters to 7 billion parameters) in handling diverse preferences. Larger models generally cope better with diverse preferences, while smaller models struggle to adapt. 3. **Calibration Performance**: The paper introduces a new metric—Expected Calibration Error (ECE)—to evaluate the performance of reward models and finds a significant positive correlation between ECE and the alignment performance of LLMs. 4. **Multi-Objective Reward Learning Method (MORE)**: To mitigate the impact of diverse preferences on model performance, the paper proposes a multi-objective reward learning method (MORE). This method improves the calibration performance of reward models by minimizing reward drift through reweighting techniques. ### Main Contributions 1. **Revealing the Relationship Between Calibration Performance and Alignment Performance**: The paper is the first to demonstrate a positive correlation between the calibration performance of reward models and the alignment performance of large language models. Additionally, learning reward models on diverse preference datasets typically leads to higher calibration errors, indicating unreliable reward values. 2. **Proposing the Multi-Objective Reward Training Scheme (MORE)**: MORE alleviates reward drift by adaptively adjusting the learning gradients of reward models, significantly improving their calibration performance, especially on shared preferences. 3. **Experimental Validation**: The paper conducts experiments on multiple widely recognized and diverse preference datasets to validate the effectiveness of MORE. The results show that MORE significantly reduces reward drift and achieves lower Expected Calibration Error (ECE) values. ### Experimental Results - **Reward Accuracy**: On mixed diverse preference datasets, larger LLMs (e.g., LLaMa2-7B) can maintain high reward accuracy. However, smaller models (e.g., Pythia-1.4B) perform poorly on some datasets. - **Calibration Performance**: Although larger models can maintain high reward accuracy, mixed diverse preference datasets affect reward distribution, leading to a decline in calibration performance. The paper further validates this through the ECE metric. - **Effectiveness of MORE**: MORE shows significant improvements in calibration performance across all preference datasets, especially on shared Helpful&Harmless preferences. This indicates that MORE can more accurately capture shared preferences and reduce calibration errors. ### Conclusion By introducing the ECE metric and the MORE method, this paper effectively addresses the impact of diverse preferences on the alignment performance of large language models, providing new insights and tools for future LLM alignment research.