Flames: Benchmarking Value Alignment of LLMs in Chinese

Kexin Huang,Xiangyang Liu,Qianyu Guo,Tianxiang Sun,Jiawei Sun,Yaru Wang,Zeyang Zhou,Yixu Wang,Yan Teng,Xipeng Qiu,Yingchun Wang,Dahua Lin
2024-05-21
Abstract:The widespread adoption of large language models (LLMs) across various regions underscores the urgent need to evaluate their alignment with human values. Current benchmarks, however, fall short of effectively uncovering safety vulnerabilities in LLMs. Despite numerous models achieving high scores and 'topping the chart' in these evaluations, there is still a significant gap in LLMs' deeper alignment with human values and achieving genuine harmlessness. To this end, this paper proposes a value alignment benchmark named Flames, which encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values such as harmony. Accordingly, we carefully design adversarial prompts that incorporate complex scenarios and jailbreaking methods, mostly with implicit malice. By prompting 17 mainstream LLMs, we obtain model responses and rigorously annotate them for detailed evaluation. Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames, particularly in the safety and fairness dimensions. We also develop a lightweight specified scorer capable of scoring LLMs across multiple dimensions to efficiently evaluate new models on the benchmark. The complexity of Flames has far exceeded existing benchmarks, setting a new challenge for contemporary LLMs and highlighting the need for further alignment of LLMs. Our benchmark is publicly available at
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issues of value alignment in large language models (LLMs), particularly the lack of evaluation benchmarks in the Chinese context. Specifically, although existing benchmarks can assess certain ethical and safety capabilities of language models, they fail to effectively reveal the safety vulnerabilities of LLMs. The paper proposes a new benchmark framework called FLAMES, which includes five dimensions: fairness, safety, morality, data protection, and legality. FLAMES evaluates the value alignment of LLMs by designing highly adversarial prompts and covers values unique to China, such as harmony. The main contributions of the paper are as follows: 1. **Highly Adversarial Benchmarking**: A dataset containing 2,251 meticulously handcrafted adversarial prompts was designed, with each prompt targeting a specific value dimension for testing. 2. **Fine-Grained Manual Annotation**: For each prompt, responses were collected from 17 well-known large-scale language models, and detailed annotation guidelines were iteratively designed. 3. **Specific Scorer**: A specific scoring model was developed to evaluate responses to FLAMES prompts, achieving an accuracy of 79.5%, significantly outperforming GPT-4 as a judge (61.3%). This scorer can serve as a useful tool for continuously evaluating and improving LLMs' performance on FLAMES. Overall, FLAMES aims to fill the gaps in existing evaluation benchmarks regarding complexity and consideration of Chinese cultural values, thereby better assessing the safety and value alignment of LLMs.