A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

Zhichao Wang,Bin Bi,Shiva Kumar Pentyala,Kiran Ramnath,Sougata Chaudhuri,Shubham Mehrotra,Zixu,Xiang-Bo Mao,Sitaram Asur,Cheng

2024-07-23

Abstract:With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issue of content generated by large language models (LLMs) not aligning with human expectations. Despite advancements in self-supervised learning enabling LLMs to generate coherent and factual responses, the varying quality of training data can lead these models to produce content that does not align with human values, such as teaching illegal activities. Therefore, the goal of the paper is to fill the review gap in this field by classifying and explaining existing alignment methods in detail, helping readers gain a comprehensive understanding of the current state of LLM alignment techniques. Specifically, the paper categorizes alignment techniques into four main categories: reward models, feedback mechanisms, reinforcement learning, and optimization methods, and further subdivides each category into several subcategories for discussion. For example, in the reward models section, the paper explores the differences between explicit and implicit reward models, the distinctions between point reward models and preference models, the differences between response-level and token-level reward models, and methods for negative preference optimization. In this way, the paper systematically summarizes various LLM alignment techniques, providing valuable reference resources for researchers and practitioners.

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

Aligning Large Language Models with Human: A Survey

Large Language Model Alignment: A Survey

Towards Scalable Automated Alignment of LLMs: A Survey

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Human-Instruction-Free LLM Self-Alignment with Limited Samples

Aligning Large Language Models via Fine-grained Supervision

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

A Survey on Human-Centric LLMs

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

Aligners: Decoupling LLMs and Alignment

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Exploring the Nexus of Large Language Models and Legal Systems: A Short Survey

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness

Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods

Alignment at Pre-training! Towards Native Alignment for Arabic LLMs

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

AlignBench: Benchmarking Chinese Alignment of Large Language Models

PURE: Aligning LLM Via Pluggable Query Reformulation for Enhanced Helpfulness