Direct Preference Optimization With Unobserved Preference Heterogeneity

Keertana Chidambaram,Karthik Vinay Seetharaman,Vasilis Syrgkanis

2024-05-24

Abstract:RLHF has emerged as a pivotal step in aligning language models with human objectives and values. It typically involves learning a reward model from human preference data and then using reinforcement learning to update the generative model accordingly. Conversely, Direct Preference Optimization (DPO) directly optimizes the generative model with preference data, skipping reinforcement learning. However, both RLHF and DPO assume uniform preferences, overlooking the reality of diverse human annotators. This paper presents a new method to align generative models with varied human preferences. We propose an Expectation-Maximization adaptation to DPO, generating a mixture of models based on latent preference types of the annotators. We then introduce a min-max regret ensemble learning model to produce a single generative method to minimize worst-case regret among annotator subgroups with similar latent factors. Our algorithms leverage the simplicity of DPO while accommodating diverse preferences. Experimental results validate the effectiveness of our approach in producing equitable generative policies.

Machine Learning

What problem does this paper attempt to address?

This paper mainly discusses how to optimize language models when there is unobserved preference heterogeneity in human preference data. Traditional reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) methods assume that preferences of all individuals are uniform, but this is not the case in reality as preferences may vary due to demographic and cultural factors. The paper proposes two new algorithms, namely Expectation Maximization Direct Preference Optimization (EM-DPO) and MinMax Direct Preference Optimization (MinMax-DPO), to adapt to diverse preferences of different population groups without relying on reinforcement learning. EM-DPO utilizes expectation maximization algorithm to simultaneously learn the distribution of user preference types and the strategies for each type. MinMax-DPO learns a model from these optimal strategies to minimize the maximum regret of subgroups of annotators with similar latent factors. These algorithms aim to address the limitations of RLHF and DPO methods, which may overlook or favor the preferences of the majority, leading to unfairness towards minority groups. Through these new methods, the goal of the paper is to generate fair and diverse generation strategies, thereby improving the representativeness of the model. Experimental results demonstrate that the proposed algorithms perform better than the standard DPO in generating fair policies, reducing the neglect of underrepresented groups, and showcasing their effectiveness in handling heterogeneous preference data.

Direct Preference Optimization With Unobserved Preference Heterogeneity

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Uncertainty-Penalized Direct Preference Optimization

New Desiderata for Direct Preference Optimization

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Policy Optimization in RLHF: The Impact of Out-of-preference Data

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences

Robust Preference Optimization through Reward Model Distillation

A General Theoretical Paradigm to Understand Learning from Human Preferences

Direct Preference Optimization with an Offset

Preference as Reward, Maximum Preference Optimization with Importance Sampling

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Active Preference Optimization for Sample Efficient RLHF

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Direct Preference-based Policy Optimization without Reward Modeling

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization