Abstract:The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these models often generate scientifically incorrect or unsafe responses, and in some cases, they may encourage users to engage in dangerous behavior. To address this issue in the field of chemistry, we introduce ChemSafetyBench, a benchmark designed to evaluate the accuracy and safety of LLM responses. ChemSafetyBench encompasses three key tasks: querying chemical properties, assessing the legality of chemical uses, and describing synthesis methods, each requiring increasingly deeper chemical knowledge. Our dataset has more than 30K samples across various chemical materials. We incorporate handcrafted templates and advanced jailbreaking scenarios to enhance task diversity. Our automated evaluation framework thoroughly assesses the safety, accuracy, and appropriateness of LLM responses. Extensive experiments with state-of-the-art LLMs reveal notable strengths and critical vulnerabilities, underscoring the need for robust safety measures. ChemSafetyBench aims to be a pivotal tool in developing safer AI technologies in chemistry. Our code and dataset are available at <a class="link-external link-https" href="https://github.com/HaochenZhao/SafeAgent4Chem" rel="external noopener nofollow">this https URL</a>. Warning: this paper contains discussions on the synthesis of controlled chemicals using AI models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inaccurate or unsafe responses that large language models (LLMs) may generate in the application of the chemical field. Specifically, these models sometimes generate scientifically incorrect or dangerous answers, and may even encourage users to engage in dangerous behaviors. To address this challenge, the author introduced **ChemSafetyBench**, a benchmarking tool specifically designed to evaluate the accuracy and safety of LLMs in the chemical field. ### Main problems 1. **Scientific inaccuracy**: LLMs may make mistakes when providing chemical information, such as incorrectly describing the properties or synthesis methods of chemical substances. 2. **Safety issues**: LLMs may generate dangerous suggestions, such as providing synthesis methods for illegal chemical substances or incorrectly evaluating the safety of chemical substances. 3. **Lack of specialized safety assessment**: Existing LLM safety assessment frameworks usually do not cover the specific needs of the chemical field, resulting in insufficient assessment in this area. ### Solutions **ChemSafetyBench** solves the above problems in the following ways: - **Dataset construction**: It contains more than 30,000 samples, covering the properties, uses, and synthesis methods of various chemical substances. - **Task design**: It is divided into three main tasks: - **Querying chemical properties**: Evaluate whether the model accurately describes the properties of chemical substances. - **Evaluating the legality of chemical uses**: Determine whether the model can correctly evaluate the legal uses of chemical substances. - **Describing synthesis methods**: Test whether the model can safely provide the synthesis methods of chemical substances. - **Diversified testing**: Introduce handicraft templates and advanced jailbreak scenarios to enhance the diversity of tasks. - **Automated assessment framework**: Comprehensively evaluate the responses of LLM from three perspectives: correctness, rejection, and safety/quality trade - off. ### Goals - **Improve safety**: Ensure that LLMs do not generate dangerous or incorrect suggestions when processing chemical information. - **Promote research**: Provide a reliable assessment tool for the development of safer AI technologies. - **Drive cooperation**: Continuously improve models and assessment criteria through cooperation with chemical experts to improve accuracy and safety. ### Conclusion By introducing **ChemSafetyBench**, the author aims to fill the gap in safety assessment of existing assessment methods in the chemical field and provide support for the development of safer AI technologies. The experimental results show that current LLMs have significant weaknesses when processing chemical information and require further research and improvement.

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Agent-SafetyBench: Evaluating the Safety of LLM Agents

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

SafetyBench: Evaluating the Safety of Large Language Models

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Are large language models superhuman chemists?

ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

Validation of the Scientific Literature via Chemputation Augmented by Large Language Models

SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models

SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types

Assessment of chemistry knowledge in large language models that generate code

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering

An Autonomous Large Language Model Agent for Chemical Literature Data Mining

A Chinese Dataset for Evaluating the Safeguards in Large Language Models