Abstract:Large language models (LLMs) have had a transformative impact on a variety of scientific tasks across disciplines such as biology, chemistry, medicine, and physics. However, ensuring the safety alignment of these models in scientific research remains an underexplored area, with existing benchmarks primarily focus on textual content and overlooking key scientific representations such as molecular, protein, and genomic languages. Moreover, the safety mechanisms of LLMs in scientific tasks are insufficiently studied. To address these limitations, we introduce SciSafeEval, a comprehensive benchmark designed to evaluate the safety alignment of LLMs across a range of scientific tasks. SciSafeEval spans multiple scientific languages - including textual, molecular, protein, and genomic - and covers a wide range of scientific domains. We evaluate LLMs in zero-shot, few-shot and chain-of-thought settings, and introduce a 'jailbreak' enhancement feature that challenges LLMs equipped with safety guardrails, rigorously testing their defenses against malicious intention. Our benchmark surpasses existing safety datasets in both scale and scope, providing a robust platform for assessing the safety and performance of LLMs in scientific contexts. This work aims to facilitate the responsible development and deployment of LLMs, promoting alignment with safety and ethical standards in scientific research.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the safety alignment issues of large language models (LLMs) in scientific research tasks. Specifically, the paper focuses on the following aspects: 1. **Limitations of Existing Benchmarks**: - Existing benchmarks mainly focus on textual content, neglecting key scientific representations such as molecules, proteins, and genomes. - Insufficient research on safety mechanisms, especially in scientific tasks. - Narrow coverage of scientific fields in benchmarks, lacking evaluations in medical and physical sciences. - Small dataset sizes, unable to comprehensively assess the safety and performance of models. 2. **Potential Risks in Scientific Tasks**: - Malicious actors may use LLMs to design harmful gene sequences, enhancing the infectivity or treatment resistance of pathogens. - Providing information on synthesizing controlled substances, lowering the technical barriers for illegal drug production. - Generating chemical representations of toxic compounds (e.g., SMILES or SELFIES), increasing the risk of misuse. - Predicting more infectious variants of SARS-CoV-2, potentially used to design highly transmissible or vaccine-resistant pathogens. 3. **Solutions**: - Introducing a comprehensive benchmark **SCISAFEEVAL**, covering multiple scientific languages (text, molecules, proteins, genomes) and a wide range of scientific fields (chemistry, biology, medicine, physics). - Evaluating LLMs through zero-shot, few-shot, and chain-of-thought settings, introducing "jailbreak" enhancements to challenge LLMs equipped with safety measures, testing their ability to counteract malicious intents. - Providing a large-scale and high-quality dataset, containing 31,840 samples, surpassing the scale and scope of existing benchmarks. ### Summary By introducing the **SCISAFEEVAL** benchmark, the paper aims to comprehensively assess the safety alignment issues of large language models in scientific tasks, promoting responsible development and deployment, ensuring that scientific research adheres to safety and ethical standards.

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

SafetyBench: Evaluating the Safety of Large Language Models

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Safety Assessment of Chinese Large Language Models

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine

ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

All Languages Matter: On the Multilingual Safety of Large Language Models

SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese