Unveiling Safety Vulnerabilities of Large Language Models

George Kour,Marcel Zalmanovici,Naama Zwerdling,Esther Goldbraich,Ora Nova Fandina,Ateret Anaby-Tavor,Orna Raz,Eitan Farchi
2023-11-08
Abstract:As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the security vulnerabilities of large - language models (LLMs) when facing adversarial inputs that may trigger harmful or inappropriate responses. Specifically, the authors focus on how to evaluate the behavior of these models when encountering specific types of attacks and identify semantic regions that may lead the models to produce harmful outputs. By introducing a new adversarial dataset (called AttaQ) and developing automated methods to systematically identify and describe the vulnerable semantic regions of the models, the paper aims to improve the understanding and enhancement of LLMs' security, thereby enhancing the reliability and security of these models. ### Main Contributions: 1. **AttaQ Dataset**: The paper presents AttaQ, a semi - automatically curated dataset that contains a series of adversarial question samples designed to trigger LLMs to produce answers that should not be provided, such as inquiries about manufacturing dangerous devices or participating in harmful activities. This dataset is used as a benchmark for evaluating the harmlessness of LLMs and for further studying the factors that influence LLM behavior. 2. **Behavior Evaluation of Different LLMs**: By analyzing the responses of different LLMs, the impact of two key operations on model behavior was evaluated: adding "harmless, helpful, honest" (HHH) instructions and adding anti - HHH instructions, where the latter requires the model to generate toxic responses. 3. **Automated Identification of Vulnerable Semantic Regions**: Automated methods were developed and investigated for systematically identifying and describing the vulnerable semantic regions in the models where successful attacks exist, that is, in these regions, attacks lead the model to output harmful and toxic responses. This was achieved by applying specialized clustering techniques that consider the semantic similarity of input attacks and the harmfulness of model responses. ### Methodology: - **Data Synthesis**: The AttaQ dataset was expanded by extracting attack samples from existing human - generated datasets and by using LLMs to generate new attack samples. The newly generated attack samples cover a wide range of potential harmful behaviors. - **Model Evaluation**: A set of recently instruction - related language models was evaluated using the AttaQ dataset, with the focus on understanding their behavior and identifying potential areas for improvement. - **Identification of Vulnerable Semantic Regions**: Several clustering algorithms were proposed for automatically identifying the vulnerable semantic regions in the models. These methods include Cluster - and - Filter (C&F), Filter - and - Cluster (F&C), Semantic - Value Fusion Clustering (SVFC), and Homogeneity - Preserving Clustering (HPC). ### Conclusion: Through the above methods, the paper not only provides a new tool for evaluating and improving the security of LLMs but also delves into the vulnerability of the models when facing specific types of attacks. This is of great significance for ensuring the safety and reliability of LLMs in practical applications.