Abstract:In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the effect of reward modeling in large - language models (LLMs), especially in terms of dataset construction and training methods. Specifically, the paper focuses on how to enhance the performance of the reward model by carefully selecting and filtering high - quality preference data, so as to better align with user preferences. ### Main Problems and Solutions 1. **Complex and Variable Human Preferences** - **Problem**: Human preferences are complex and diverse, and difficult to comprehensively represent, which makes the training of the reward model difficult. - **Solution**: By introducing a series of effective data selection and filtering strategies, create a high - quality open - source preference dataset (Skywork - Reward) to ensure that the preference pairs included in the dataset can effectively improve the model performance. 2. **Quality and Scale Problems of Existing Datasets** - **Problem**: Existing open - source preference datasets often have noise, and the differences between preference pairs are either too subtle or inconsistently labeled, which affects the performance of the reward model. - **Solution**: The paper proposes a lightweight but efficient preference dataset Skywork - Reward, which contains only 80,000 preference pairs, far less than existing datasets, but significantly improves the model performance through strict data screening and optimization. 3. **Improving the Training Objectives of the Reward Model** - **Problem**: Traditional loss functions may not be able to maximize the gap between positive and negative samples in some cases, resulting in poor model performance. - **Solution**: The paper experiments with multiple loss function variants (such as Focal Loss, Hinge Loss, etc.), but finally finds that the classic Bradley - Terry loss function is the most robust in practical applications. ### Main Contributions of the Paper - **High - Quality Dataset**: Propose the Skywork - Reward dataset, and ensure the high quality of the dataset through strict screening and filtering strategies. - **Efficient Training Method**: Develop the Skywork - Reward model series (such as Skywork - Reward - Gemma - 27B and Skywork - Reward - Llama - 3.1 - 8B), and achieve excellent results in the RewardBench benchmark test. - **Wide Applicability**: The proposed techniques and datasets have been widely used in other research, proving their value in actual preference learning tasks. ### Summary By focusing on the construction of datasets and the optimization of training methods, the paper successfully solves multiple challenges in reward model training, especially for the complexity of human preferences and the quality problems of existing datasets. Through these improvements, the paper not only improves the performance of the reward model, but also provides valuable resources and methodological support for future research.

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

RewardBench: Evaluating Reward Models for Language Modeling

Towards Comprehensive Preference Data Collection for Reward Modeling

RMB: Comprehensively Benchmarking Reward Models in LLM Alignment

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

On Designing Effective RL Reward at Training Time for LLM Reasoning

How to Evaluate Reward Models for RLHF

Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment

Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Everyone Deserves A Reward: Learning Customized Human Preferences

M-RewardBench: Evaluating Reward Models in Multilingual Settings

HelpSteer2-Preference: Complementing Ratings with Preferences

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

RRM: Robust Reward Model Training Mitigates Reward Hacking

Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

Tool-Augmented Reward Modeling