Abstract:Reinforcement Learning from Human Feedback aligns the outputs of Large Language Models with human values and preferences. Central to this process is the reward model (RM), which translates human feedback into training signals for optimising LLM behaviour. However, RMs can develop biases by exploiting spurious correlations in their training data, such as favouring outputs based on length or style rather than true quality. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours in LLMs alignment. This paper addresses the challenge of correcting such biases without additional data and training, introducing the concept of Post-hoc Reward Calibration. We first propose an intuitive approach to estimate the bias term and, thus, remove it to approximate the underlying true reward. We then extend the approach to a more general and robust form with the Locally Weighted Regression. Focusing on the prevalent length bias, we validate our proposed approaches across three experimental settings, demonstrating consistent improvements: (1) a 3.11 average performance gain across 33 reward models on the RewardBench dataset; (2) enhanced alignment of RM rankings with GPT-4 evaluations and human preferences based on the AlpacaEval benchmark; and (3) improved Length-Controlled win rate of the RLHF process in multiple LLM--RM combinations. Our method is computationally efficient and generalisable to other types of bias and RMs, offering a scalable and robust solution for mitigating biases in LLM alignment. Our code and results are available at <a class="link-external link-https" href="https://github.com/ZeroYuHuang/Reward-Calibration" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the **bias problem in the reward model (RM)**, especially the bias that occurs during Reinforcement Learning from Human Feedback (RLHF). Specifically, the author focuses on the **length bias**, that is, the reward model may evaluate the output based on its length rather than its true quality. #### Main problem description 1. **Bias in the reward model**: - The reward model (RM) may exploit spurious correlations in the training data during the training process, such as preferring length - or style - based outputs rather than truly reflecting the quality of the output. - These biases can lead to incorrect output rankings, sub - optimal model evaluations, and amplify bad behaviors when aligning large - language models (LLM). 2. **Specific impacts of length bias**: - Length bias means that the reward model tends to give higher scores to longer outputs, even if these outputs are not necessarily of higher quality. - This bias may cause the model to generate overly long responses without actually improving the quality of the content. 3. **Limitations of existing methods**: - Existing methods usually require additional data collection, retraining of the reward model, or modification of the reinforcement learning algorithm, which increases complexity and cost. #### Solutions proposed in the paper To solve the above problems, the author introduced the concept of **Post - hoc Reward Calibration**. The specific methods are as follows: - **Estimating the bias term**: Assume that the score of the reward model can be decomposed into two parts: the true reward and the bias term. By estimating the bias term and removing it, the true quality of the output can be more accurately reflected. - **Locally Weighted Regression (LWR)**: In order to estimate the bias term more generally and robustly, the author extended the simple mean estimation method and used LWR for bias estimation. #### Experimental verification The author verified the proposed method in three experimental settings: 1. **Benchmark performance test**: The performance of 33 reward models was tested on the RewardBench dataset, and the results showed that the average performance was improved by 3.11. 2. **LLM evaluation**: Based on the AlpacaEval benchmark, 184 LLMs were ranked using 8 open - source reward models, and the calibrated rankings were more consistent with GPT - 4 evaluations and human preferences. 3. **LLM alignment**: Improvements in the RLHF process were tested in multiple LLM - RM combinations, and the results showed an increase in the win rate under length control. Through these experiments, the author proved that the proposed post - hoc reward calibration method is not only computationally efficient but also can be generalized to other types of biases and reward models, providing a scalable and robust solution to alleviate the bias problem in LLM alignment.

Post-hoc Reward Calibration: A Case Study on Length Bias

Taming Overconfidence in LLMs: Reward Calibration in RLHF

RMB: Comprehensively Benchmarking Reward Models in LLM Alignment

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Rethinking the Role of Proxy Rewards in Language Model Alignment

Confronting Reward Model Overoptimization with Constrained RLHF

How to Evaluate Reward Models for RLHF

Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning

RewardBench: Evaluating Reward Models for Language Modeling

Prior Constraints-based Reward Model Training for Aligning Large Language Models

On Diversified Preferences of Large Language Model Alignment

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Loose Lips Sink Ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Language Model Alignment with Elastic Reset