Abstract:In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.

What problem does this paper attempt to address?

This paper attempts to solve the problems of large - language models (LLMs) in safety alignment. Specifically, the paper aims to promote the research on the safety alignment of LLMs by providing a large - scale safety preference dataset (PKU - SafeRLHF). This dataset contains 446,000 refined prompts and 265,000 pairs of questions and answers, which are tagged with safety meta - tags of 19 harm categories and three severity levels (minor, medium, and severe). In addition, based on these question - answer pairs, 166,800 preference data were also collected, including double - preference (separate labeling of helpfulness and harmlessness) and single - preference (weighing helpfulness and harmlessness from scratch) data. The main contributions of the paper are as follows: 1. **Dataset construction**: It provides a large - scale safety preference dataset, including detailed harm classification and severity grading, which is helpful for more refined evaluation and improvement of the safety of LLMs. 2. **Safety alignment method**: Using large - scale annotation data, a severity - sensitive adjustment model is trained for risk control of LLMs, and a safety - centered reinforcement learning from human feedback (RLHF) algorithm is proposed to achieve the safety alignment of LLMs. 3. **Application verification**: The effectiveness of the dataset is verified through experiments, showing the significant effect of the PKU - SafeRLHF - based dataset in improving the safety and helpfulness of the model. In conclusion, through the construction and application of the PKU - SafeRLHF dataset, this paper provides an important resource and methodological support for the safety alignment research of LLMs.

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Enhancing LLM Safety via Constrained Direct Preference Optimization

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

PURE: Aligning LLM Via Pluggable Query Reformulation for Enhanced Helpfulness

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

SafeWorld: Geo-Diverse Safety Alignment

Safer-Instruct: Aligning Language Models with Automated Preference Data

Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts