Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Jiayi Zhou,Jiaming Ji,Juntao Dai,Yaodong Yang

2024-08-31

Abstract:Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This method enables richer and fine-grained language feedback without additional annotations, models, or training stages. Our experiments demonstrated its effectiveness, specifically, reducing the refusal-to-response paradigm in single-turn safety dialogues and the long-response bias in text summarization tasks. We provide further analysis that seq2seq RM improves RLHF performance across 2B and 7B LLMs on 3 NLP tasks, achieving an average win rate of 76.9\%. We further show that seq2seq RM can still improve the performance of RLHF under out-of-distribution prompts.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to provide more abundant reward feedback to improve the alignment performance in Reinforcement Learning from Human Feedback (RLHF). Specifically, traditional RLHF methods use scalar reward models (seq2scalar RM), which can only provide coarse - grained scalar feedback and are likely to cause the model to fall into unexpected generalization behaviors during the optimization process, such as refusing to answer or generating overly long answers. These problems impede the effective alignment of large - language models (LLMs) with human intentions and values. To alleviate these problems, the paper proposes a new sequence - to - sequence (seq2seq) reward - modeling method. This method improves the accuracy and fine - grainedness of RLHF by learning from language feedback instead of relying solely on scalar feedback, without the need for additional data annotation, training, or models. The main contributions of the paper include: 1. **Proposing seq2seq reward modeling**: Its core idea is to learn from language feedback rather than scalar feedback, which can improve accuracy and fine - grainedness, thereby improving the performance of RLHF. 2. **Experimental verification**: Further experiments show that seq2seq RM can not only alleviate the unexpected behaviors of RLHF but also enhance its alignment performance. Specifically, on three natural - language - processing tasks, seq2seq RM improves the performance by an average of 76.9% on models with 2B and 7B parameters. 3. **Robustness**: seq2seq RM shows stronger scoring accuracy when dealing with out - of - distribution (OOD) inputs. Through these improvements, the paper aims to provide a more effective method to align large - language models with human intentions and values, thereby improving the safety and practicality of the models.

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Self-Evolved Reward Learning for LLMs

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Secrets of RLHF in Large Language Models Part II: Reward Modeling

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

SLiC-HF: Sequence Likelihood Calibration with Human Feedback

Reward-Robust RLHF in LLMs

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Personalized Language Modeling from Personalized Human Feedback

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Learning Goal-Conditioned Representations for Language Reward Models

Prototypical Reward Network for Data-Efficient RLHF

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

LIRE: listwise reward enhancement for preference alignment

Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Deep Reinforcement Learning For Sequence to Sequence Models