Fine-Tuning Language Models with Reward Learning on Policy

Hao Lang,Fei Huang,Yongbin Li

2024-03-28

Abstract:Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. RLHF contains three steps, i.e., human preference collecting, reward learning, and policy optimization, which are usually performed serially. Despite its popularity, however, (fixed) reward models may suffer from inaccurate off-distribution, since policy optimization continuously shifts LLMs' data distribution. Repeatedly collecting new preference data from the latest LLMs may alleviate this issue, which unfortunately makes the resulting system more complicated and difficult to optimize. In this paper, we propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution. Specifically, an unsupervised multi-view learning method is introduced to learn robust representations of policy samples. Meanwhile, a synthetic preference generation approach is developed to simulate high-quality preference data with policy outputs. Extensive experiments on three benchmark datasets show that RLP consistently outperforms the state-of-the-art. Our code is available at \url{

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper proposes a new approach called RLHF (Reward Learning based Reinforcement Learning) to address the problem of aligning large language models (LLMs) with human preferences. The standard RLHF consists of three steps: collecting human feedback, reward learning, and policy optimization. However, the accuracy of the reward model may be affected by changes in data distribution caused by policy optimization. To address this issue, the paper introduces the Reward Learning on Policy (RLP) framework, which updates the reward model in an unsupervised manner to maintain consistency in its distribution. RLP includes two methods: Unsupervised Multi-view Learning (UML) and Synthetic Preference Generation (SPG), to improve the quality of data and selectively generate preference data. Experimental results demonstrate that RLP outperforms existing methods on three benchmark datasets.

Fine-Tuning Language Models with Reward Learning on Policy

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Personalized Language Modeling from Personalized Human Feedback

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

Confronting Reward Model Overoptimization with Constrained RLHF

Fine-Tuning Language Models from Human Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Optimal Design for Reward Modeling in RLHF