Annotation-Efficient Preference Optimization for Language Model Alignment

Yuu Jinnai,Ukyo Honda

2024-05-22

Abstract:Preference optimization is a standard approach to fine-tuning large language models to align with human preferences. The quality, diversity, and quantity of the preference dataset are critical to the effectiveness of preference optimization. However, obtaining a large amount of high-quality and diverse preference annotations is difficult in many applications. This raises the question of how to use the limited annotation budget to create an effective preference dataset. To this end, we propose Annotation-Efficient Preference Optimization (AEPO). Instead of exhaustively annotating preference over all available response texts, AEPO selects a subset of responses that maximizes quality and diversity from the available responses, and then annotates preference over the selected ones. In this way, AEPO focuses the annotation budget on labeling preference over a smaller subset of responses with diversity and of high quality. We evaluate the performance of Direct Preference Optimization (DPO) using AEPO and show that it outperforms models trained using a standard DPO with the same annotation budget. Our code is available at

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate an effective preference dataset under a limited annotation budget for the alignment optimization of large - scale language models (LLMs). Specifically, the paper focuses on how to reduce the annotation workload by selecting diverse and high - quality responses while maintaining or improving the performance of the model. Existing methods such as Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) all rely on a large amount of high - quality preference - annotated data, but the acquisition of these data is costly and time - consuming. Therefore, the paper proposes Annotation - Efficient Preference Optimization (AEPO), a new preference optimization method, which aims to reduce the annotation requirements through an efficient sub - sampling strategy, thereby constructing a more effective, more diverse and higher - quality preference dataset under a limited budget. The main contributions of AEPO are as follows: 1. **Reducing annotation costs**: Compared with the traditional West - of - N (WoN) strategy, AEPO significantly reduces the required number of annotations by selecting a subset of diverse and high - quality responses for annotation instead of annotating all responses. 2. **Improving model performance**: The experimental results show that AEPO outperforms the traditional DPO method on multiple datasets (such as AlpacaFarm, Anthropic's Helpfulness and Harmlessness datasets), especially when the number of responses is large. 3. **Applicable to multiple tasks**: AEPO not only performs well in language model alignment tasks, but also shows good generalization ability in other tasks (such as ARC, HellaSwag, TruthfulQA and WinoGrande benchmark tests). Through these improvements, AEPO provides a feasible method for efficiently training and optimizing large - scale language models in resource - constrained environments.

Annotation-Efficient Preference Optimization for Language Model Alignment

Direct Preference Optimization with an Offset

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

New Desiderata for Direct Preference Optimization

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

AIPO: Improving Training Objective for Iterative Preference Optimization

Towards Efficient Exact Optimization of Language Model Alignment

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Filtered Direct Preference Optimization

Robust Preference Optimization through Reward Model Distillation

ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Token-level Direct Preference Optimization

Direct Preference Optimization Using Sparse Feature-Level Constraints