Abstract:We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference. GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization) that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluations.

Natural Language Generation Using Reinforcement Learning with External Rewards

Deep Reinforcement Learning for Dialogue Generation

Deep Reinforcement Learning for NLP.

Adaptive Natural Language Generation for Task-oriented Dialogue via Reinforcement Learning

Natural Language Reinforcement Learning

Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

Learning to Generate Better Than Your LLM

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

Mapping Language to Programs using Multiple Reward Components with Inverse Reinforcement Learning

Reward Design with Language Models

Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations

ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation

NaRLE: Natural Language Models using Reinforcement Learning with Emotion Feedback

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

REvolve: Reward Evolution with Large Language Models using Human Feedback

Deep Reinforcement Learning For Sequence to Sequence Models

Offline RL for Natural Language Generation with Implicit Language Q Learning

Prompt-Based Length Controlled Generation with Reinforcement Learning

Improving a sequence-to-sequence nlp model using a reinforcement learning policy algorithm

Generative Reward Models

Fine-Tuning Language Models from Human Preferences