Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Raz Lapid,Ron Langberg,Moshe Sipper
2024-08-05
Abstract:Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. Unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an LLM's outputs for unintended purposes. In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that -- when combined with a user's query -- disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs. Our novel approach systematically reveals a model's limitations and vulnerabilities by uncovering instances where its responses deviate from expected behavior. Through extensive experiments we demonstrate the efficacy of our technique, thus contributing to the ongoing discussion on responsible AI development by providing a diagnostic tool for evaluating and enhancing alignment of LLMs with human intent. To our knowledge this is the first automated universal black box jailbreak attack.
Computation and Language,Computer Vision and Pattern Recognition,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem this paper attempts to address is whether it is possible to automatically achieve "jailbreaking" of large language models (LLMs) without relying on the internal structure of these models. Specifically, the authors propose a new method based on genetic algorithms (GA) that can automatically generate general adversarial prompts capable of manipulating LLM behavior without knowing the model architecture and parameters, thereby causing the model to produce unintended and potentially harmful outputs. ### Background and Motivation With the advent of large language models (LLMs), the field of artificial intelligence has undergone irreversible changes. These complex neural networks, trained on large-scale text and code datasets, demonstrate powerful capabilities in generating high-quality human text, translating languages, and creating various creative content. However, while the potential applications of these models are extensive, their limitations and vulnerabilities have also raised concerns. Despite significant efforts to align LLMs with human values and social norms, risks of unintended bias and potential misuse remain. "Jailbreaking" an LLM refers to exploiting its internal mechanisms to induce outputs that deviate from its intended purpose. Traditional "jailbreaking" methods often rely on handcrafted prompts or adversarial examples, which typically require extensive domain knowledge and significant manual effort. ### Main Contributions of the Paper 1. **Proposed a new black-box attack method**: Utilizing genetic algorithms (GA) to automatically discover general adversarial prompts capable of manipulating LLM behavior without accessing the model's internal architecture and parameters. 2. **Validated the method's effectiveness**: Through experiments, demonstrated the effectiveness and generality of the method on two open-source LLM architectures, and analyzed the effectiveness and transferability of the adversarial prompts. 3. **Revealed the vulnerability of LLMs**: The study results indicate that LLMs are susceptible to adversarial attacks even without internal information, highlighting the need to strengthen LLM security. ### Method Overview - **Genetic Algorithm (GA)**: By simulating the process of natural selection, GA can automatically search for and optimize adversarial prompts. The initial population consists of randomly generated prompts, which evolve into more effective prompts through selection, crossover, and mutation operations. - **Fitness Function**: To evaluate the effectiveness of the prompts, the authors designed an indirect fitness function based on semantic alignment. This function quantifies the effect of the prompts by calculating the cosine similarity between the model output and the target output. - **Experimental Setup**: The experiments used two well-known LLM models (LLaMA2-7b-chat and Vicuna-7b) and tested them on a dataset containing instances of harmful behavior. ### Experimental Results and Discussion - **High Success Rate**: The experimental results show that the method can effectively induce LLMs to generate harmful content, particularly excelling in certain specific tasks. - **Transferability**: The adversarial prompts exhibit a certain degree of transferability between different models, indicating the generality of the method. - **Security Implications**: The study results emphasize the vulnerability of LLMs to adversarial attacks, calling for developers and organizations to strengthen security measures for these models. ### Conclusion This paper successfully achieves the automatic "jailbreaking" of large language models by proposing a black-box attack method based on genetic algorithms. This finding not only reveals the security risks of existing LLMs but also provides new directions for future security research. The authors call on researchers, developers, and policymakers to work together to ensure that the development of LLMs is both powerful and ethical, ultimately promoting social well-being.