A Survey of Reinforcement Learning from Human Feedback

Timo Kaufmann,Paul Weng,Viktor Bengs,Eyke Hüllermeier

DOI: https://doi.org/10.48550/arXiv.2312.14925

2024-05-01

Abstract:Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning offers a promising avenue to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The training of large language models (LLMs) has impressively demonstrated this potential in recent years, where RLHF played a decisive role in directing the model's capabilities toward human objectives. This article provides a comprehensive overview of the fundamentals of RLHF, exploring the intricate dynamics between RL agents and human input. While recent focus has been on RLHF for LLMs, our survey adopts a broader perspective, examining the diverse applications and wide-ranging impact of the technique. We delve into the core principles that underpin RLHF, shedding light on the symbiotic relationship between algorithms and human feedback, and discuss the main research trends in the field. By synthesizing the current landscape of RLHF research, this article aims to provide researchers as well as practitioners with a comprehensive understanding of this rapidly growing field of research.

Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of reward function design in traditional reinforcement learning (RL). Specifically, traditional RL methods rely on pre - defined reward functions to guide the learning process of agents, but the design of such reward functions is very difficult in many practical applications. For example, in complex environments such as home robot assistance or autonomous vehicle navigation, it is difficult to clearly define an appropriate reward function. Moreover, even a reward function that seems reasonable at first may lead to unexpected behaviors, because the agent may over - optimize certain parts of these reward functions, resulting in undesired outcomes, such as the "reward - hacking" phenomenon. To solve these problems, the paper explores the method of reinforcement learning from human feedback (RLHF). RLHF enables the agent to more accurately learn goals that are in line with human values by introducing human feedback as part of the learning process. This method not only helps to overcome the limitations of traditional RL methods, but also improves the consistency between the agent and human goals, and promotes the development of ethically sound and socially responsible artificial intelligence systems. The paper also discusses the origin, development history, theoretical basis of RLHF and its applications in multiple fields, including large - language - model (LLM) fine - tuning, image generation, continuous control, games and robotics, etc. In addition, the paper analyzes the main trends and technical challenges in current RLHF research and presents an outlook on future research directions.

A Survey of Reinforcement Learning from Human Feedback

The History and Risks of Reinforcement Learning and Human Feedback

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

RLHF Workflow: From Reward Modeling to Online RLHF

Human-in-the-Loop Reinforcement Learning: A Survey and Position on Requirements, Challenges, and Opportunities

RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration

Personalized Language Modeling from Personalized Human Feedback

Self-Evolved Reward Learning for LLMs

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Parameter Efficient Reinforcement Learning from Human Feedback

Multi-turn Reinforcement Learning from Preference Human Feedback

A Survey on Explainable Reinforcement Learning: Concepts, Algorithms, Challenges

Reinforcement Learning from AI Feedback A Review

Prototypical Reward Network for Data-Efficient RLHF

Secrets of RLHF in Large Language Models Part II: Reward Modeling