Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Wei Xiong,Hanze Dong,Chenlu Ye,Ziqi Wang,Han Zhong,Heng Ji,Nan Jiang,Tong Zhang
2024-05-01
Abstract:This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. Then, to understand the mathematical principle of RLHF, we consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of large language model demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their potent practical implementations.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper mainly investigates the problem of model alignment in Reinforcement Learning from Human Feedback (RLHF) based on generative models. Existing methods such as offline PPO and offline DPO have been pointed out to have deficiencies in environmental policy exploration. The researchers understand the mathematical principles of RLHF through the backward KL regularization and contextual bandit problem. Although this formalization is widely used in practice, its theoretical analysis has not been sufficiently studied. The paper proposes algorithms in three different settings (offline, online, and mixed) and provides theoretical guarantees with limited samples. For offline learning, they use pessimistic estimation to improve efficiency. For online iterative learning, they propose a strategy based on batch exploration. Experimental results show that these new methods significantly outperform existing baselines such as DPO and RSO in alignment experiments of large-scale language models, demonstrating the connection between theoretical foundation and practical implementation. In summary, this paper attempts to address how to align generative models more effectively in RLHF, especially when considering the exploration of environmental policies and the imperfection of reward functions. It also explores how to guide practical algorithm design through theoretical analysis to reduce overfitting and reward hacking issues and improve model performance.