garak: A Framework for Security Probing Large Language Models

Leon Derczynski,Erick Galinkin,Jeffrey Martin,Subho Majumdar,Nanna Inie
2024-06-17
Abstract:As Large Language Models (LLMs) are deployed and integrated into thousands of applications, the need for scalable evaluation of how models respond to adversarial attacks grows rapidly. However, LLM security is a moving target: models produce unpredictable output, are constantly updated, and the potential adversary is highly diverse: anyone with access to the internet and a decent command of natural language. Further, what constitutes a security weak in one context may not be an issue in a different context; one-fits-all guardrails remain theoretical. In this paper, we argue that it is time to rethink what constitutes ``LLM security'', and pursue a holistic approach to LLM security evaluation, where exploration and discovery of issues are central. To this end, this paper introduces garak (Generative AI Red-teaming and Assessment Kit), a framework which can be used to discover and identify vulnerabilities in a target LLM or dialog system. garak probes an LLM in a structured fashion to discover potential vulnerabilities. The outputs of the framework describe a target model's weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts, and can inform alignment and policy discussions for LLM deployment.
Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
The paper primarily focuses on the issue of security assessment for large language models (LLMs), especially as the demand for their security evaluation grows with their widespread deployment in various applications. The paper points out that traditional security assessment methods struggle to cope with the evolving characteristics of LLMs and the diverse potential threats they face. Therefore, the authors propose a new framework called garak (Generative AI Red-teaming and Assessment Kit). The garak framework aims to conduct security audits of LLMs in a structured manner, promoting the exploration and discovery of security issues. Specifically, the framework includes the following key components: 1. **Generators**: Any object or system responsible for generating text. 2. **Probes**: A series of tests designed to elicit specific types of vulnerabilities from the target LLM. 3. **Detectors**: Tools used to automatically identify failure patterns in the model's responses. 4. **Buffs**: Modifications to the interaction between probes and generators to reveal more potential issues. Through the collaborative work of these components, garak can test for different security issues and provide detailed reports on the weaknesses of the target model. Additionally, garak supports attack generation capabilities, allowing it to adaptively generate new test cases based on previous successful attempts. Overall, the goal of the paper is to promote the security assessment of LLMs by proposing a new, flexible, and scalable framework, thereby helping researchers and developers better understand and address the security challenges of these complex systems.