Abstract:Reinforcement learning with human feedback~(RLHF) is critical for aligning Large Language Models (LLMs) with human preference. Compared to the widely studied offline version of RLHF, \emph{e.g.} direct preference optimization (DPO), recent works have shown that the online variants achieve even better alignment. However, online alignment requires on-the-fly generation of new training data, which is costly, hard to parallelize, and suffers from varying quality and utility. In this paper, we propose a more efficient data exploration strategy for online preference tuning (OPTune), which does not rely on human-curated or pre-collected teacher responses but dynamically samples informative responses for on-policy preference alignment. During data generation, OPTune only selects prompts whose (re)generated responses can potentially provide more informative and higher-quality training signals than the existing responses. In the training objective, OPTune reweights each generated response (pair) by its utility in improving the alignment so that learning can be focused on the most helpful samples. Throughout our evaluations, OPTune'd LLMs maintain the instruction-following benefits provided by standard preference tuning whilst enjoying 1.27-1.56x faster training speed due to the efficient data exploration strategy.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the efficiency of data generation and training during the Online Preference Tuning (OPT) process while maintaining or enhancing the alignment quality of large - language models (LLMs) with human preferences. Specifically, the author points out two main bottlenecks in current online preference tuning methods: 1. **High cost of data generation**: Online preference tuning requires continuously generating new response data during the training process, which is not only time - consuming but also difficult to parallelize, resulting in low overall training efficiency. 2. **Uneven data quality**: The quality of the generated response data varies, affecting the training effect of the model. To overcome these challenges, the paper proposes a method named OPTune, which improves the efficiency of online preference tuning in the following two aspects: ### 1. Reward - based Prompt Selection OPTune reduces unnecessary data generation by only regenerating those responses with low current reward values. The specific steps are as follows: - **Select prompts with low rewards**: In each iteration, OPTune will select the prompts with the lowest current generated response reward values for regeneration. - **Mix high - quality responses**: The regenerated responses are mixed with existing high - quality responses to form a complete training set. This method can reduce the cost of data generation while ensuring that the generated data is more conducive to model training. ### 2. Weighted DPO Loss Function The traditional DPO loss function quantifies scalar rewards into binary labels, resulting in information loss. OPTune introduces a weighted DPO loss function (wDPO), which optimizes the training process by directly using the difference in reward values. The specific formula is as follows: \[ L_{\text{wDPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ R(x, y_w, y_l) \cdot \log \left( I(x, y_w, y_l) \right) \right] \] where: - \( I(x, y_w, y_l) = \sigma \left( \beta_1 \log \frac{\pi_{t + 1}(y_w|x)}{\pi_t(y_w|x)} - \beta_1 \log \frac{\pi_{t + 1}(y_l|x)}{\pi_t(y_l|x)} \right) \) - \( R(x, y_w, y_l) = \sigma \left( \beta_2 (r(x, y_w) - r(x, y_l)) \right) \) In this way, the wDPO loss function can more effectively use the reward signal, giving priority to learning response pairs with large reward differences, thereby improving training efficiency. ### Experimental Results The paper verifies the effectiveness of OPTune through a series of experiments: - **Generation efficiency**: OPTune significantly reduces the time of data generation while maintaining performance, achieving a 1.27 - to - 1.56 - fold increase in training speed. - **Training efficiency**: OPTune using the wDPO loss function reaches the same performance level faster than the traditional DPO method. - **Benchmark tests**: In multiple benchmark tests, the models trained by OPTune perform well in terms of factuality, multi - task solving ability, primary school mathematics, and common - sense reasoning. In summary, OPTune significantly improves the efficiency and effectiveness of online preference tuning by optimizing the data generation and training processes, providing new ideas for developing resource - efficient preference - aligned LLMs.

OPTune: Efficient Online Preference Tuning

Beyond Reward: Offline Preference-guided Policy Optimization

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Active Preference Optimization for Sample Efficient RLHF

Parameter-Efficient Tuning Helps Language Model Alignment

Accelerated Preference Optimization for Large Language Model Alignment

TSO: Self-Training with Scaled Preference Optimization

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Statistical Rejection Sampling Improves Preference Optimization

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Towards Efficient Exact Optimization of Language Model Alignment