ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

Haochen Zhao,Xiangru Tang,Ziran Yang,Xiao Han,Xuanzhi Feng,Yueqing Fan,Senhao Cheng,Di Jin,Yilun Zhao,Arman Cohan,Mark Gerstein
2024-11-23
Abstract:The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these models often generate scientifically incorrect or unsafe responses, and in some cases, they may encourage users to engage in dangerous behavior. To address this issue in the field of chemistry, we introduce ChemSafetyBench, a benchmark designed to evaluate the accuracy and safety of LLM responses. ChemSafetyBench encompasses three key tasks: querying chemical properties, assessing the legality of chemical uses, and describing synthesis methods, each requiring increasingly deeper chemical knowledge. Our dataset has more than 30K samples across various chemical materials. We incorporate handcrafted templates and advanced jailbreaking scenarios to enhance task diversity. Our automated evaluation framework thoroughly assesses the safety, accuracy, and appropriateness of LLM responses. Extensive experiments with state-of-the-art LLMs reveal notable strengths and critical vulnerabilities, underscoring the need for robust safety measures. ChemSafetyBench aims to be a pivotal tool in developing safer AI technologies in chemistry. Our code and dataset are available at <a class="link-external link-https" href="https://github.com/HaochenZhao/SafeAgent4Chem" rel="external noopener nofollow">this https URL</a>. Warning: this paper contains discussions on the synthesis of controlled chemicals using AI models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inaccurate or unsafe responses that large language models (LLMs) may generate in the application of the chemical field. Specifically, these models sometimes generate scientifically incorrect or dangerous answers, and may even encourage users to engage in dangerous behaviors. To address this challenge, the author introduced **ChemSafetyBench**, a benchmarking tool specifically designed to evaluate the accuracy and safety of LLMs in the chemical field. ### Main problems 1. **Scientific inaccuracy**: LLMs may make mistakes when providing chemical information, such as incorrectly describing the properties or synthesis methods of chemical substances. 2. **Safety issues**: LLMs may generate dangerous suggestions, such as providing synthesis methods for illegal chemical substances or incorrectly evaluating the safety of chemical substances. 3. **Lack of specialized safety assessment**: Existing LLM safety assessment frameworks usually do not cover the specific needs of the chemical field, resulting in insufficient assessment in this area. ### Solutions **ChemSafetyBench** solves the above problems in the following ways: - **Dataset construction**: It contains more than 30,000 samples, covering the properties, uses, and synthesis methods of various chemical substances. - **Task design**: It is divided into three main tasks: - **Querying chemical properties**: Evaluate whether the model accurately describes the properties of chemical substances. - **Evaluating the legality of chemical uses**: Determine whether the model can correctly evaluate the legal uses of chemical substances. - **Describing synthesis methods**: Test whether the model can safely provide the synthesis methods of chemical substances. - **Diversified testing**: Introduce handicraft templates and advanced jailbreak scenarios to enhance the diversity of tasks. - **Automated assessment framework**: Comprehensively evaluate the responses of LLM from three perspectives: correctness, rejection, and safety/quality trade - off. ### Goals - **Improve safety**: Ensure that LLMs do not generate dangerous or incorrect suggestions when processing chemical information. - **Promote research**: Provide a reliable assessment tool for the development of safer AI technologies. - **Drive cooperation**: Continuously improve models and assessment criteria through cooperation with chemical experts to improve accuracy and safety. ### Conclusion By introducing **ChemSafetyBench**, the author aims to fill the gap in safety assessment of existing assessment methods in the chemical field and provide support for the development of safer AI technologies. The experimental results show that current LLMs have significant weaknesses when processing chemical information and require further research and improvement.