Direct Preference Optimization With Unobserved Preference Heterogeneity

Keertana Chidambaram,Karthik Vinay Seetharaman,Vasilis Syrgkanis
2024-05-24
Abstract:RLHF has emerged as a pivotal step in aligning language models with human objectives and values. It typically involves learning a reward model from human preference data and then using reinforcement learning to update the generative model accordingly. Conversely, Direct Preference Optimization (DPO) directly optimizes the generative model with preference data, skipping reinforcement learning. However, both RLHF and DPO assume uniform preferences, overlooking the reality of diverse human annotators. This paper presents a new method to align generative models with varied human preferences. We propose an Expectation-Maximization adaptation to DPO, generating a mixture of models based on latent preference types of the annotators. We then introduce a min-max regret ensemble learning model to produce a single generative method to minimize worst-case regret among annotator subgroups with similar latent factors. Our algorithms leverage the simplicity of DPO while accommodating diverse preferences. Experimental results validate the effectiveness of our approach in producing equitable generative policies.
Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses how to optimize language models when there is unobserved preference heterogeneity in human preference data. Traditional reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) methods assume that preferences of all individuals are uniform, but this is not the case in reality as preferences may vary due to demographic and cultural factors. The paper proposes two new algorithms, namely Expectation Maximization Direct Preference Optimization (EM-DPO) and MinMax Direct Preference Optimization (MinMax-DPO), to adapt to diverse preferences of different population groups without relying on reinforcement learning. EM-DPO utilizes expectation maximization algorithm to simultaneously learn the distribution of user preference types and the strategies for each type. MinMax-DPO learns a model from these optimal strategies to minimize the maximum regret of subgroups of annotators with similar latent factors. These algorithms aim to address the limitations of RLHF and DPO methods, which may overlook or favor the preferences of the majority, leading to unfairness towards minority groups. Through these new methods, the goal of the paper is to generate fair and diverse generation strategies, thereby improving the representativeness of the model. Experimental results demonstrate that the proposed algorithms perform better than the standard DPO in generating fair policies, reducing the neglect of underrepresented groups, and showcasing their effectiveness in handling heterogeneous preference data.