A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

Zhichao Wang,Bin Bi,Shiva Kumar Pentyala,Kiran Ramnath,Sougata Chaudhuri,Shubham Mehrotra,Zixu,Xiang-Bo Mao,Sitaram Asur,Cheng
2024-07-23
Abstract:With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of content generated by large language models (LLMs) not aligning with human expectations. Despite advancements in self-supervised learning enabling LLMs to generate coherent and factual responses, the varying quality of training data can lead these models to produce content that does not align with human values, such as teaching illegal activities. Therefore, the goal of the paper is to fill the review gap in this field by classifying and explaining existing alignment methods in detail, helping readers gain a comprehensive understanding of the current state of LLM alignment techniques. Specifically, the paper categorizes alignment techniques into four main categories: reward models, feedback mechanisms, reinforcement learning, and optimization methods, and further subdivides each category into several subcategories for discussion. For example, in the reward models section, the paper explores the differences between explicit and implicit reward models, the distinctions between point reward models and preference models, the differences between response-level and token-level reward models, and methods for negative preference optimization. In this way, the paper systematically summarizes various LLM alignment techniques, providing valuable reference resources for researchers and practitioners.