Abstract:As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.

What problem does this paper attempt to address?

This paper attempts to address the security vulnerabilities of large - language models (LLMs) when facing adversarial inputs that may trigger harmful or inappropriate responses. Specifically, the authors focus on how to evaluate the behavior of these models when encountering specific types of attacks and identify semantic regions that may lead the models to produce harmful outputs. By introducing a new adversarial dataset (called AttaQ) and developing automated methods to systematically identify and describe the vulnerable semantic regions of the models, the paper aims to improve the understanding and enhancement of LLMs' security, thereby enhancing the reliability and security of these models. ### Main Contributions: 1. **AttaQ Dataset**: The paper presents AttaQ, a semi - automatically curated dataset that contains a series of adversarial question samples designed to trigger LLMs to produce answers that should not be provided, such as inquiries about manufacturing dangerous devices or participating in harmful activities. This dataset is used as a benchmark for evaluating the harmlessness of LLMs and for further studying the factors that influence LLM behavior. 2. **Behavior Evaluation of Different LLMs**: By analyzing the responses of different LLMs, the impact of two key operations on model behavior was evaluated: adding "harmless, helpful, honest" (HHH) instructions and adding anti - HHH instructions, where the latter requires the model to generate toxic responses. 3. **Automated Identification of Vulnerable Semantic Regions**: Automated methods were developed and investigated for systematically identifying and describing the vulnerable semantic regions in the models where successful attacks exist, that is, in these regions, attacks lead the model to output harmful and toxic responses. This was achieved by applying specialized clustering techniques that consider the semantic similarity of input attacks and the harmfulness of model responses. ### Methodology: - **Data Synthesis**: The AttaQ dataset was expanded by extracting attack samples from existing human - generated datasets and by using LLMs to generate new attack samples. The newly generated attack samples cover a wide range of potential harmful behaviors. - **Model Evaluation**: A set of recently instruction - related language models was evaluated using the AttaQ dataset, with the focus on understanding their behavior and identifying potential areas for improvement. - **Identification of Vulnerable Semantic Regions**: Several clustering algorithms were proposed for automatically identifying the vulnerable semantic regions in the models. These methods include Cluster - and - Filter (C&F), Filter - and - Cluster (F&C), Semantic - Value Fusion Clustering (SVFC), and Homogeneity - Preserving Clustering (HPC). ### Conclusion: Through the above methods, the paper not only provides a new tool for evaluating and improving the security of LLMs but also delves into the vulnerability of the models when facing specific types of attacks. This is of great significance for ensuring the safety and reliability of LLMs in practical applications.

Unveiling Safety Vulnerabilities of Large Language Models

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Exploring the Adversarial Capabilities of Large Language Models

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

On Evaluating Adversarial Robustness of Large Vision-Language Models

Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Exploring Vulnerabilities and Protections in Large Language Models: A Survey

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models

Misusing Tools in Large Language Models With Visual Adversarial Examples

CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses

Risk and Response in Large Language Models: Evaluating Key Threat Categories

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Query-Based Adversarial Prompt Generation