Post-hoc Reward Calibration: A Case Study on Length Bias

Zeyu Huang,Zihan Qiu,Zili Wang,Edoardo M. Ponti,Ivan Titov
2024-09-26
Abstract:Reinforcement Learning from Human Feedback aligns the outputs of Large Language Models with human values and preferences. Central to this process is the reward model (RM), which translates human feedback into training signals for optimising LLM behaviour. However, RMs can develop biases by exploiting spurious correlations in their training data, such as favouring outputs based on length or style rather than true quality. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours in LLMs alignment. This paper addresses the challenge of correcting such biases without additional data and training, introducing the concept of Post-hoc Reward Calibration. We first propose an intuitive approach to estimate the bias term and, thus, remove it to approximate the underlying true reward. We then extend the approach to a more general and robust form with the Locally Weighted Regression. Focusing on the prevalent length bias, we validate our proposed approaches across three experimental settings, demonstrating consistent improvements: (1) a 3.11 average performance gain across 33 reward models on the RewardBench dataset; (2) enhanced alignment of RM rankings with GPT-4 evaluations and human preferences based on the AlpacaEval benchmark; and (3) improved Length-Controlled win rate of the RLHF process in multiple LLM--RM combinations. Our method is computationally efficient and generalisable to other types of bias and RMs, offering a scalable and robust solution for mitigating biases in LLM alignment. Our code and results are available at <a class="link-external link-https" href="https://github.com/ZeroYuHuang/Reward-Calibration" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the **bias problem in the reward model (RM)**, especially the bias that occurs during Reinforcement Learning from Human Feedback (RLHF). Specifically, the author focuses on the **length bias**, that is, the reward model may evaluate the output based on its length rather than its true quality. #### Main problem description 1. **Bias in the reward model**: - The reward model (RM) may exploit spurious correlations in the training data during the training process, such as preferring length - or style - based outputs rather than truly reflecting the quality of the output. - These biases can lead to incorrect output rankings, sub - optimal model evaluations, and amplify bad behaviors when aligning large - language models (LLM). 2. **Specific impacts of length bias**: - Length bias means that the reward model tends to give higher scores to longer outputs, even if these outputs are not necessarily of higher quality. - This bias may cause the model to generate overly long responses without actually improving the quality of the content. 3. **Limitations of existing methods**: - Existing methods usually require additional data collection, retraining of the reward model, or modification of the reinforcement learning algorithm, which increases complexity and cost. #### Solutions proposed in the paper To solve the above problems, the author introduced the concept of **Post - hoc Reward Calibration**. The specific methods are as follows: - **Estimating the bias term**: Assume that the score of the reward model can be decomposed into two parts: the true reward and the bias term. By estimating the bias term and removing it, the true quality of the output can be more accurately reflected. - **Locally Weighted Regression (LWR)**: In order to estimate the bias term more generally and robustly, the author extended the simple mean estimation method and used LWR for bias estimation. #### Experimental verification The author verified the proposed method in three experimental settings: 1. **Benchmark performance test**: The performance of 33 reward models was tested on the RewardBench dataset, and the results showed that the average performance was improved by 3.11. 2. **LLM evaluation**: Based on the AlpacaEval benchmark, 184 LLMs were ranked using 8 open - source reward models, and the calibrated rankings were more consistent with GPT - 4 evaluations and human preferences. 3. **LLM alignment**: Improvements in the RLHF process were tested in multiple LLM - RM combinations, and the results showed an increase in the win rate under length control. Through these experiments, the author proved that the proposed post - hoc reward calibration method is not only computationally efficient but also can be generalized to other types of biases and reward models, providing a scalable and robust solution to alleviate the bias problem in LLM alignment.