PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

Jiaming Ji,Donghai Hong,Borong Zhang,Boyuan Chen,Josef Dai,Boren Zheng,Tianyi Qiu,Boxun Li,Yaodong Yang
2024-10-16
Abstract:In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve the problems of large - language models (LLMs) in safety alignment. Specifically, the paper aims to promote the research on the safety alignment of LLMs by providing a large - scale safety preference dataset (PKU - SafeRLHF). This dataset contains 446,000 refined prompts and 265,000 pairs of questions and answers, which are tagged with safety meta - tags of 19 harm categories and three severity levels (minor, medium, and severe). In addition, based on these question - answer pairs, 166,800 preference data were also collected, including double - preference (separate labeling of helpfulness and harmlessness) and single - preference (weighing helpfulness and harmlessness from scratch) data. The main contributions of the paper are as follows: 1. **Dataset construction**: It provides a large - scale safety preference dataset, including detailed harm classification and severity grading, which is helpful for more refined evaluation and improvement of the safety of LLMs. 2. **Safety alignment method**: Using large - scale annotation data, a severity - sensitive adjustment model is trained for risk control of LLMs, and a safety - centered reinforcement learning from human feedback (RLHF) algorithm is proposed to achieve the safety alignment of LLMs. 3. **Application verification**: The effectiveness of the dataset is verified through experiments, showing the significant effect of the PKU - SafeRLHF - based dataset in improving the safety and helpfulness of the model. In conclusion, through the construction and application of the PKU - SafeRLHF dataset, this paper provides an important resource and methodological support for the safety alignment research of LLMs.