Abstract:In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences derived from the same prompts, and it functions without needing an additional reward model. However, DPO does not fully reflect the complex nature of human learning, which often involves understanding contrasting responses to not only identical but also similar questions. To overcome this shortfall, we propose Relative Preference Optimization (RPO). RPO is designed to discern between more and less preferred responses derived from both identical and related prompts. It introduces a contrastive weighting mechanism, enabling the tuning of LLMs using a broader range of preference data, including both paired and unpaired sets. This approach expands the learning capabilities of the model, allowing it to leverage insights from a more varied set of prompts. Through empirical tests, including dialogue and summarization tasks, and evaluations using the AlpacaEval2.0 leaderboard, RPO has demonstrated a superior ability to align LLMs with user preferences and to improve their adaptability during the training process. Our code can be viewed at <a class="link-external link-https" href="https://github.com/yinyueqin/relative-preference-optimization" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper attempts to address the challenges faced by large language models (LLMs) in aligning with user preferences. Although existing Direct Preference Optimization (DPO) methods have improved model alignment to some extent, they primarily rely on preference pairs derived from the same prompts, failing to fully reflect the complexity of human learning, especially in understanding similar but not identical prompts. To address this issue, the paper proposes the Relative Preference Optimization (RPO) method. ### Specific Problem Description 1. **Limitations of Preference Alignment**: - While existing DPO methods are effective, their training process is limited to preference pairs obtained from the same prompts, restricting the model's learning scope. - Human learning often requires understanding the differences between different but related prompts, which DPO fails to fully utilize. 2. **Challenges in Data Acquisition**: - Acquiring paired preference data can be challenging and costly, especially in sensitive fields such as healthcare and personalized services, where ethical considerations are paramount. 3. **Enhancing Model Adaptability**: - Improving the model's adaptability in different contexts, particularly in the absence of explicit preference pairs, remains an important research direction. ### Solution The RPO method proposed in the paper addresses the above issues through the following approaches: 1. **Contrastive Weight Mechanism**: - RPO introduces a contrastive weight mechanism that can utilize a broader range of preference data during training, including both paired and unpaired datasets. - This mechanism allows the model to learn from more diverse prompts, thereby better aligning with user preferences. 2. **Semantic Relevance Analysis**: - RPO analyzes the semantic similarity of prompts within each mini-batch, classifying prompt pairs as highly relevant or irrelevant. - A contrastive matrix is constructed to guide the model in distinguishing preferred from non-preferred responses, applicable to both identical and semantically related prompts. 3. **Weight Strategies**: - Three different weight strategies are proposed to recalibrate the comparison of each contrastive sample pair: - Embedding Distance Weighting Strategy: Weights based on the distance of prompt feature embeddings. - Uniform Weighting Strategy: Assigns equal weight to each contrastive pair. - Diagonal Emphasis Weighting Strategy: In paired data scenarios, gives more weight to diagonal elements (comparisons of the same prompt). ### Experimental Validation The paper validates the effectiveness of RPO through a series of experiments, including dialogue and summarization tasks, and evaluates using the AlpacaEval2.0 leaderboard. Experimental results show that RPO significantly outperforms existing alignment methods such as DPO, IPO, and KTO across multiple large language models, demonstrating its superior performance in key language processing tasks. ### Conclusion By introducing a contrastive weight mechanism and semantic relevance analysis, RPO expands the model's learning scope, enhances its adaptability and alignment performance in different contexts, and more closely approximates the human learning process.

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

Ordinal Preference Optimization: Aligning Human Preferences via NDCG

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Aligning CodeLLMs with Direct Preference Optimization

Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Token-level Direct Preference Optimization

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

WPO: Enhancing RLHF with Weighted Preference Optimization

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Accelerated Preference Optimization for Large Language Model Alignment

New Desiderata for Direct Preference Optimization

Minor DPO reject penalty to increase training robustness

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game