Abstract:Learning from human feedback via proxy reward modeling has been studied to align Large Language Models (LLMs) with human values. However, achieving reliable training through that proxy reward model (RM) is not a trivial problem, and its behavior remained as a black-box. In this paper, we study the role of proxy rewards in the LLM alignment via `reverse reward engineering' by composing interpretable features as a white-box reward function. We aim to replicate the ground truth (gold) reward signal by achieving a monotonic relationship between the proxy and gold reward signals after training the model using the proxy reward in reinforcement learning (RL). Our findings indicate that successfully emulating the gold reward requires generating responses that are relevant with enough length to open-ended questions, while also ensuring response consistency in closed-ended questions. Furthermore, resulting models optimizing our devised white-box reward show competitive performances with strong open-source RMs in alignment benchmarks. We highlight its potential usage as a simple but strong reward baseline for the LLM alignment, not requiring explicit human feedback dataset and RM training. Our code is available at <a class="link-external link-https" href="https://github.com/naver-ai/rethinking-proxy-reward" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to more reliably align with human values when large - language models (LLMs) perform human - feedback - based reinforcement learning (RLHF) through proxy reward models (RMs). Specifically, the paper focuses on how to design interpretable features as white - box reward functions through "reverse reward engineering" to achieve a monotonic relationship between proxy reward signals and true reward signals. This helps to maximize the proxy reward during the training process and also maximize the true reward during testing, thereby improving the effect of model alignment with human values.
### Paper Background
1. **Limitations of Proxy Reward Models**:
- Proxy reward models (RMs) trained with human feedback have reliability issues in practical applications, and their behavior is often like a black box and difficult to understand.
- The optimization of proxy reward models may lead to over - optimization problems, that is, the proxy reward keeps increasing, but the true reward saturates or even decreases.
2. **Shortcomings of Existing Research**:
- Human feedback itself has limitations. For example, preference judgments are influenced by prior knowledge and are easily influenced by surface features (such as simplicity and concreteness).
- Policy models are likely to find unwanted shortcuts in imperfect black - box RMs, that is, "reward hacking".
### Paper Methods
1. **Reverse Reward Engineering**:
- Replace traditional black - box RMs by designing white - box reward functions with interpretable features, such as response length, relevance, and repetition penalty.
- The goal is to establish a monotonic relationship between proxy reward signals and true reward signals during RL training.
2. **Experimental Design**:
- Use Proximal Policy Optimization (PPO) for RL training.
- Evaluation metrics include the monotonic relationship between proxy rewards and true rewards, the performance of the model in alignment benchmark tests, etc.
### Main Contributions
1. **Effectiveness of Reverse Reward Engineering**:
- The white - box reward function designed through reverse reward engineering can significantly improve the alignment effect, especially in generating long - enough and relevant responses to open - ended questions and maintaining consistency in closed - ended questions.
2. **Importance of Reward Branching**:
- Adjusting the reward function according to the query type (open - ended or closed - ended) can further improve the alignment effect of the model while reducing unnecessary verbosity.
3. **Potential Applications**:
- The designed white - box reward function can be used as a simple and powerful reward baseline without an explicit human - feedback dataset and RM training.
### Experimental Results
1. **Verification of Monotonic Relationship**:
- The experimental results show that the reward function (RER) that combines length incentives, repetition penalties, and query relevance can maintain a monotonic relationship between proxy rewards and true rewards in multiple evaluations.
2. **Improvement of Alignment Effect**:
- The PPO model optimized for RER shows performance comparable to or even better than that of strong open - source RMs in multiple alignment benchmark tests.
3. **Minimization of Alignment Tax**:
- Reward branching and relevance features help to minimize the alignment tax, that is, while increasing the preference degree, reducing the performance degradation in other NLP tasks.
### Conclusion
This paper designs a new white - box reward function through reverse reward engineering, successfully solves the reliability and interpretability problems of proxy reward models in RLHF, and provides new ideas and methods for the alignment of large - language models.