HelpSteer2: Open-source dataset for training top-performing reward models

Zhilin Wang,Yi Dong,Olivier Delalleau,Jiaqi Zeng,Gerald Shen,Daniel Egert,Jimmy J. Zhang,Makesh Narsimhan Sreedhar,Oleksii Kuchaiev

2024-06-13

Abstract:High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful internal base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024. Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. In particular, we propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models. HelpSteer2 is available at <a class="link-external link-https" href="https://huggingface.co/datasets/nvidia/HelpSteer2" rel="external noopener nofollow">this https URL</a> and code is available at <a class="link-external link-https" href="https://github.com/NVIDIA/NeMo-Aligner" rel="external noopener nofollow">this https URL</a>

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper focuses on constructing high-quality preference datasets to train large-scale language models (LLMs) to generate high-quality responses that align with human preferences. As LLMs become more powerful and better aligned, permissibility datasets like Open Assistant, HH-RLHF, and HelpSteer need to be updated to maintain effectiveness. Due to usage restrictions on some preference datasets extracted from proprietary LLMs such as GPT-4, the authors propose a permissibility licensed dataset called HelpSteer2 (CC-BY-4.0) to improve response quality and attribute annotation quality. HelpSteer2 consists of approximately 10,000 response pairs, which is an order of magnitude smaller than existing preference datasets like HH-RLHF, but still capable of effectively training reward models. By using this dataset, they achieve the state-of-the-art score (92.0%) on the Reward Bench main dataset, surpassing other open-source and proprietary models. Additionally, they propose a model alignment method called SteerLM 2.0, which effectively utilizes the rich multi-attribute scores predicted by the reward model to train the model to follow complex multi-instruction requirements. The data collection process includes obtaining prompts from sources like ShareGPT, clustering using BERTopic, and generating responses through internal LLMs, Mixtral-8x7B-Instruct-v0.1, and human annotators. They also improve the annotation process by increasing the number of annotators to improve quality and document the steps of data collection and processing in detail. Through these efforts, they create an efficient and diverse dataset that can be used to train powerful reward models for better alignment of LLMs.

HelpSteer2: Open-source dataset for training top-performing reward models

HelpSteer2-Preference: Complementing Ratings with Preferences

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

How to Evaluate Reward Models for RLHF

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Towards Comprehensive Preference Data Collection for Reward Modeling

RewardBench: Evaluating Reward Models for Language Modeling

Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison

Everyone Deserves A Reward: Learning Customized Human Preferences

Hummer: Towards Limited Competitive Preference Dataset

UltraFeedback: Boosting Language Models with High-quality Feedback

LIRE: listwise reward enhancement for preference alignment

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment

Tool-Augmented Reward Modeling

Online Self-Preferring Language Models

General Preference Modeling with Preference Representations for Aligning Language Models