Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models

Ashutosh Baheti,Ximing Lu,Faeze Brahman,Ronan Le Bras,Maarten Sap,Mark Riedl
2024-04-20
Abstract:Reinforcement Learning with Human Feedback (RLHF) is the most prominent method for Language Model (LM) alignment. However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's value estimate, A-LoL only trains on positive advantage (leftover) data points, making it resilient to noise. Overall, A-LoL is an easy-to-implement, sample-efficient, and stable LM training recipe.
Computation and Language
What problem does this paper attempt to address?