CDR: Customizable Density Ratios of Strong-over-weak LLMs for Preference Annotation

Guangxuan Xu,Kai Xu,Shivchander Sudalairaj,Hao Wang,Akash Srivastava
2024-11-12
Abstract:Preference tuning of large language models (LLMs) relies on high-quality human preference data, which is often expensive and time-consuming to gather. While existing methods can use trained reward models or proprietary model as judges for preference annotation, they have notable drawbacks: training reward models remain dependent on initial human data, and using proprietary model imposes license restrictions that inhibits commercial usage. In this paper, we introduce customized density ratio (CDR), a training-free and highly effective method that leverages off-the-shelf LLMs for preference data annotation. Our approach uses the log-density ratio between a better-aligned LLM and a less aligned LLM as a reward signal. We explores 221 different LLMs pairs and empirically demonstrate that increasing the performance gap between paired LLMs correlates with better reward generalization. Furthermore, we show that tailoring the density ratio reward function with specific criteria and preference exemplars enhances performance across domains and within target areas. In our experiment using density ratio from a pair of Mistral-7B models, CDR achieves a RewardBench score of 82.6, outperforming the best trained reward functions from same model class and demonstrating competitive performance against SoTA models in Safety (91.0) and Reasoning (88.0) domains. We use CDR to annotate an on-policy preference dataset with which we preference tune Llama-3-8B-Instruct with SimPO. Using reward signals from two relatively weak models, our approach pushes Llama-3-8B to achieve a 37.4% (+15.1%) win rate on ArenaHard and a 40.7% (+17.8%) win rate on Length-Controlled AlpacaEval 2.0, along with a score of 8.0 on MT-Bench.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the dependence on high - quality human preference data in the preference tuning process of large - language models (LLMs). Specifically, high - quality human preference data is usually both expensive and time - consuming, and existing methods either rely on initial human data to train the reward model or use proprietary models as evaluation criteria, but these methods all have obvious drawbacks: the former still requires expensive human data, and the latter is not conducive to commercial applications due to license restrictions. Therefore, the paper proposes a new method named Customized Density Ratio (CDR), which does not require training and can use off - the - shelf LLMs for preference data annotation, thereby effectively reducing the dependence on high - quality human data. By using the log - density ratio between two LLMs with different alignment levels as a reward signal, the paper explores 221 different LLM pairing combinations and empirically shows that increasing the performance gap between paired LLMs can improve the generalization ability of the reward signal. In addition, the paper also shows that customizing the density - ratio reward function through specific criteria and preference examples can enhance performance across domains and within the target area. Specifically, the experiments in the paper show that when using a set of density ratios of the Mistral - 7B model pair, CDR scores 82.6 on RewardBench, outperforming the best - trained reward function in the same model category and showing performance comparable to the current state - of - the - art models in the safety and reasoning fields.