PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

Daiwei Chen,Yi Chen,Aniket Rege,Ramya Korlakai Vinayak

2024-06-13

Abstract:Large foundation models pretrained on raw web-scale data are not readily deployable without additional step of extensive alignment to human preferences. Such alignment is typically done by collecting large amounts of pairwise comparisons from humans ("Do you prefer output A or B?") and learning a reward model or a policy with the Bradley-Terry-Luce (BTL) model as a proxy for a human's underlying implicit preferences. These methods generally suffer from assuming a universal preference shared by all humans, which lacks the flexibility of adapting to plurality of opinions and preferences. In this work, we propose PAL, a framework to model human preference complementary to existing pretraining strategies, which incorporates plurality from the ground up. We propose using the ideal point model as a lens to view alignment using preference comparisons. Together with our novel reformulation and using mixture modeling, our framework captures the plurality of population preferences while simultaneously learning a common preference latent space across different preferences, which can few-shot generalize to new, unseen users. Our approach enables us to use the penultimate-layer representation of large foundation models and simple MLP layers to learn reward functions that are on-par with the existing large state-of-the-art reward models, thereby enhancing efficiency of reward modeling significantly. We show that PAL achieves competitive reward model accuracy compared to strong baselines on 1) Language models with Summary dataset ; 2) Image Generative models with Pick-a-Pic dataset ; 3) A new semisynthetic heterogeneous dataset generated using Anthropic Personas. Finally, our experiments also highlight the shortcoming of current preference datasets that are created using rigid rubrics which wash away heterogeneity, and call for more nuanced data collection approaches.

Machine Learning

What problem does this paper attempt to address?

The problem this paper attempts to address is how to learn from diverse user preferences and construct an alignment framework that can adapt to various human preferences. Specifically, existing large pre-trained models cannot be directly deployed to real-world applications without extensive human preference alignment. These alignment processes typically rely on collecting a large amount of pairwise comparison data (e.g., "Do you prefer output A or B?") and then learning a reward model or policy through methods such as the Bradley-Terry-Luce (BTL) model. However, this approach usually assumes that all humans share a universal preference, lacking the flexibility to accommodate different opinions and preferences. Therefore, the paper proposes a method called PAL (Pluralistic Alignment Framework), which aims to capture diverse human preferences from the ground up and redefines the alignment problem by incorporating the ideal point model. Additionally, through mixed modeling, the PAL framework can capture the diversity of preferences within the population while learning the latent space of common preferences and can generalize to new, unseen users. Experimental results show that PAL not only effectively captures the diverse preferences of users but also achieves comparable or even better performance than existing large-scale reward models in language and image generation tasks, with significantly reduced parameter size and computational cost.

PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

General Preference Modeling with Preference Representations for Aligning Language Models

LRHP: Learning Representations for Human Preferences via Preference Pairs

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Aligning Large Language Models with Self-generated Preference Data

PERSONA: A Reproducible Testbed for Pluralistic Alignment

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Preference Ranking Optimization for Human Alignment

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

Panacea: Pareto Alignment via Preference Adaptation for LLMs

Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards

Axiomatic Preference Modeling for Longform Question Answering

Active Preference-based Learning for Multi-dimensional Personalization

Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Orchestrating LLMs with Different Personalizations

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference

Diffusion Model Alignment Using Direct Preference Optimization