Abstract:Reinforcement learning is used to align language models with human preference signals after first pre-training the model to predict the next token of text within a large corpus using likelihood maximization. Before being deployed in a specific domain, models are often further fine-tuned on task specific data. Since human preferences are often unavailable for the last step, it is performed using likelihood maximization as that is the typical default method. However, reinforcement learning has other advantages besides facilitating alignment to a human derived reward function. For one, whereas likelihood maximization is a form of imitation learning in which the model is trained on what to do under ideal conditions, reinforcement learning is not limited to demonstrating actions just for optimally reached states and trains a model what to do under a range of scenarios as it explores the policy space. In addition, it also trains a model what not to do, suppressing competitive but poor actions. This work develops a framework for last-mile fine-tuning using reinforcement learning and tests whether it garners performance gains. The experiments center on abstractive summarization, but the framework is general and broadly applicable. Use of the procedure produced significantly better results than likelihood maximization when comparing raw predictions. For the specific data tested, the gap could be bridged by employing post-processing of the maximum likelihood outputs. Nonetheless, the framework offers a new avenue for model optimization in situations where post-processing may be less straightforward or effective, and it can be extended to include more complex classes of undesirable outputs to penalize and train against, such as hallucinations.

Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models

Fine-Tuning Language Models with Reward Learning on Policy

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Dual Active Learning for Reinforcement Learning from Human Feedback

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Reward Modeling with Weak Supervision for Language Models

Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning of Large Language Models

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Aligning Language Models with Offline Learning from Human Feedback

Reward Difference Optimization For Sample Reweighting In Offline RLHF

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback with Active Queries

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

RLHF Workflow: From Reward Modeling to Online RLHF

Online Self-Preferring Language Models

Progressively Label Enhancement for Large Language Model Alignment