Fine-Tuning Language Models with Reward Learning on Policy

Hao Lang,Fei Huang,Yongbin Li
2024-03-28
Abstract:Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. RLHF contains three steps, i.e., human preference collecting, reward learning, and policy optimization, which are usually performed serially. Despite its popularity, however, (fixed) reward models may suffer from inaccurate off-distribution, since policy optimization continuously shifts LLMs' data distribution. Repeatedly collecting new preference data from the latest LLMs may alleviate this issue, which unfortunately makes the resulting system more complicated and difficult to optimize. In this paper, we propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution. Specifically, an unsupervised multi-view learning method is introduced to learn robust representations of policy samples. Meanwhile, a synthetic preference generation approach is developed to simulate high-quality preference data with policy outputs. Extensive experiments on three benchmark datasets show that RLP consistently outperforms the state-of-the-art. Our code is available at \url{
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper proposes a new approach called RLHF (Reward Learning based Reinforcement Learning) to address the problem of aligning large language models (LLMs) with human preferences. The standard RLHF consists of three steps: collecting human feedback, reward learning, and policy optimization. However, the accuracy of the reward model may be affected by changes in data distribution caused by policy optimization. To address this issue, the paper introduces the Reward Learning on Policy (RLP) framework, which updates the reward model in an unsupervised manner to maintain consistency in its distribution. RLP includes two methods: Unsupervised Multi-view Learning (UML) and Synthetic Preference Generation (SPG), to improve the quality of data and selectively generate preference data. Experimental results demonstrate that RLP outperforms existing methods on three benchmark datasets.