Abstract:Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the language model's responses. As language models improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs).

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to address the challenges faced by large - language models (LLMs) in reinforcement learning from human feedback (RLHF). Specifically, the paper focuses on the following core issues: 1. **Dependence on a large amount of high - quality human - annotated data**: - Traditional RLHF methods require a large amount of high - quality human preference data to train the reward model (RM). These data are usually provided by experts or generated by advanced AI systems. Obtaining such data is both expensive and time - consuming, and may introduce biases, affecting the response quality of the language model. - As the performance of the language model improves, the effect of human input may gradually weaken, further limiting the improvement space of the model. 2. **Reducing the dependence on human - annotated data**: - To overcome the above problems, the paper proposes a new method - Self - Evolved Reward Learning (SER). SER generates additional training data through the reward model (RM), thereby iteratively improving itself. - Through this method, even with limited human - annotated data, the model can robustly improve the performance of RM through self - feedback, and then enhance the capabilities of large - language models. ### Main contributions of the paper 1. **Proposing a new self - evolving reward learning framework**: - It can achieve performance comparable to that of a model trained with a complete human - annotated data set with only 15% of human - annotated seed data, significantly reducing the dependence on human data. 2. **Providing the extensive impact of the self - learning paradigm in LLMs**: - In particular, it offers new insights in improving reinforcement learning by enhancing RM. 3. **Extensive experimental verification**: - The experimental results show that the self - evolving reward learning framework can consistently improve performance on various LLMs, model sizes, and data sets, with an average improvement of 7.88%. After multiple iterations, the finally converged performance can reach or even exceed that of a model trained with all human - annotated data sets. ### Method overview 1. **Self - annotation**: - In the initial stage, RM is trained with a small amount of human - annotated data and then performs self - annotation on unannotated data. 2. **Identifying the learning state and selecting high - confidence data**: - Evaluate the current ability of RM to distinguish between good and bad answers or magnify the differences between similar answers. Select high - confidence data according to the learning state. 3. **Retraining RM**: - Retrain RM using the filtered high - confidence data, and gradually improve its understanding of answer quality through a pairwise loss function. 4. **Training LLM through RL**: - Use the self - evolving RM to guide the training of LLM and optimize the LLM's policy through a modified PPO algorithm. ### Experimental results - **Performance improvement**: - On multiple data sets and models with different parameter scales, the SER method significantly improves the model's performance, with an average improvement of 7.88%. - In some cases, the performance of the SER method even exceeds that of a model trained with complete human - annotated data. - **Performance in data - rich scenarios**: - In data - rich scenarios (such as Stack Overflow), the performance improvement of the SER method is relatively small but still effective. This indicates that in the case of sufficient data, the model can obtain a more extensive data distribution through self - evolution, further increasing the performance ceiling. ### Conclusion By proposing the Self - Evolved Reward Learning (SER) method, this paper successfully solves the problem of relying on a large amount of high - quality human - annotated data in RLHF and significantly improves the performance of large - language models. This method performs well on a variety of data sets and models and has broad application prospects.

Self-Evolved Reward Learning for LLMs

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Prototypical Reward Network for Data-Efficient RLHF

Self-Rewarding Language Models

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding

Semi-Supervised Reward Modeling via Iterative Self-Training

Dual Active Learning for Reinforcement Learning from Human Feedback

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

REvolve: Reward Evolution with Large Language Models using Human Feedback

Reward-Robust RLHF in LLMs

A Survey of Reinforcement Learning from Human Feedback

Reward Modeling with Weak Supervision for Language Models

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Parameter Efficient Reinforcement Learning from Human Feedback

Language Model Self-improvement by Reinforcement Learning Contemplation

Personalized Language Modeling from Personalized Human Feedback