Preference Optimization with Multi-Sample Comparisons

Chaoqi Wang,Zhuokai Zhao,Chen Zhu,Karthik Abinav Sankararaman,Michal Valko,Xuefei Cao,Zhaorun Chen,Madian Khabsa,Yuxin Chen,Hao Ma,Sinong Wang
2024-10-16
Abstract:Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the limitations in current post - training methods for generative models, especially the problem that Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Alignment (DAP) methods mainly rely on single - sample comparison. These methods are often unable to capture important characteristics of generative models, such as generative diversity and bias, and these characteristics can be more accurately evaluated through multi - sample comparison. Specifically, the paper proposes the following points: 1. **Lack of generative diversity**: Existing post - training methods perform poorly in terms of the diversity of generative models. For example, large - language models (LLMs) may be biased towards certain specific types when generating different types of narratives, lacking diversity. 2. **Bias problems**: Generative models may exhibit gender or racial bias when generating data. For example, when generating images, the model may generate more images of a particular gender, or when generating random numbers, the model may be biased towards certain specific numbers. 3. **Limitations of single - sample comparison**: Current methods mainly rely on single - sample comparison, which cannot comprehensively evaluate the distribution characteristics of the model, such as diversity and bias. For example, evaluating a model's creativity or consistency requires analyzing the variability of multiple outputs, not just a single output. To overcome these limitations, the paper introduces new methods of multi - sample comparison, including multi - sample Direct Preference Optimization (mDPO) and multi - sample Identity Preference Optimization (mIPO). These methods better align the model's output with the desired distribution characteristics by evaluating the collective characteristics of multiple samples, thereby improving the performance and reliability of generative models.