Abstract:Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the limitations in current post - training methods for generative models, especially the problem that Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Alignment (DAP) methods mainly rely on single - sample comparison. These methods are often unable to capture important characteristics of generative models, such as generative diversity and bias, and these characteristics can be more accurately evaluated through multi - sample comparison. Specifically, the paper proposes the following points: 1. **Lack of generative diversity**: Existing post - training methods perform poorly in terms of the diversity of generative models. For example, large - language models (LLMs) may be biased towards certain specific types when generating different types of narratives, lacking diversity. 2. **Bias problems**: Generative models may exhibit gender or racial bias when generating data. For example, when generating images, the model may generate more images of a particular gender, or when generating random numbers, the model may be biased towards certain specific numbers. 3. **Limitations of single - sample comparison**: Current methods mainly rely on single - sample comparison, which cannot comprehensively evaluate the distribution characteristics of the model, such as diversity and bias. For example, evaluating a model's creativity or consistency requires analyzing the variability of multiple outputs, not just a single output. To overcome these limitations, the paper introduces new methods of multi - sample comparison, including multi - sample Direct Preference Optimization (mDPO) and multi - sample Identity Preference Optimization (mIPO). These methods better align the model's output with the desired distribution characteristics by evaluating the collective characteristics of multiple samples, thereby improving the performance and reliability of generative models.

Preference Optimization with Multi-Sample Comparisons

Preference as Reward, Maximum Preference Optimization with Importance Sampling

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Statistical Rejection Sampling Improves Preference Optimization

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

The Crucial Role of Samplers in Online Direct Preference Optimization

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

New Desiderata for Direct Preference Optimization

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Direct Preference Optimization With Unobserved Preference Heterogeneity

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

AIPO: Improving Training Objective for Iterative Preference Optimization

Preference Optimization as Probabilistic Inference

On the Generalization of Preference Learning with DPO