Compositional preference models for aligning LMs

Dongyoung Go,Tomasz Korbak,Germán Kruszewski,Jos Rozen,Marc Dymetman

2024-03-15

Abstract:As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. Through these simple steps, CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper primarily focuses on the issue of aligning language models (LMs) with human preferences. As the capabilities of language models increase, it becomes increasingly important to make these models better conform to human preferences. However, existing methods for training preference models (PMs) have some fundamental limitations, such as lack of transparency, poor scalability, and a tendency to overfit the preference dataset. The paper proposes **Compositional Preference Models (CPMs)**, a new framework for preference models. CPMs decompose global preference evaluation into multiple interpretable features and score each feature by prompting a language model (e.g., GPT-3.5). These scores are then aggregated using a logistic regression classifier. This approach not only improves generalization and robustness but also makes the model more transparent and interpretable. Experimental results show that CPMs outperform traditional preference models in terms of generalization ability and also perform better in terms of overfitting. Additionally, in automatic evaluation experiments, the best samples obtained using CPMs are generally more preferred by humans compared to those obtained using traditional preference models. In summary, this method demonstrates the benefits of endowing preference models with prior knowledge about which features determine human preferences and relies on the capabilities of language models to extract these features in a scalable and robust manner.

Compositional preference models for aligning LMs

General Preference Modeling with Preference Representations for Aligning Language Models

Panacea: Pareto Alignment via Preference Adaptation for LLMs

Dissecting Human and LLM Preferences

ComPO: Community Preferences for Language Model Personalization

Orchestrating LLMs with Different Personalizations

Uncovering Factor Level Preferences to Improve Human-Model Alignment

A Survey on Human Preference Learning for Large Language Models

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Aligning Large Language Model with Direct Multi-Preference Optimization for Recommendation

LLM-augmented Preference Learning from Natural Language

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLMs

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment