PAL: Proxy-Guided Black-Box Attack on Large Language Models

Chawin Sitawarin,Norman Mu,David Wagner,Alexandre Araujo
2024-02-15
Abstract:Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated concerning capabilities to generate harmful content when manipulated. While techniques like safety fine-tuning aim to minimize harmful use, recent works have shown that LLMs remain vulnerable to attacks that elicit toxic responses. In this work, we introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting. In particular, it relies on a surrogate model to guide the optimization and a sophisticated loss designed for real-world LLM APIs. Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art. We also propose GCG++, an improvement to the GCG attack that reaches 94% ASR on white-box Llama-2-7B, and the Random-Search Attack on LLMs (RAL), a strong but simple baseline for query-based attacks. We believe the techniques proposed in this work will enable more comprehensive safety testing of LLMs and, in the long term, the development of better security guardrails. The code can be found at
Computation and Language,Artificial Intelligence,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the security issue of large - language models (LLMs), especially the ability of these models to generate harmful content when facing malicious inputs. Although the generation of harmful content can be reduced through technical means such as safety fine - tuning, existing research shows that LLMs are still vulnerable to attacks that can induce the models to generate toxic or inappropriate content. The paper introduces a new attack method - Proxy - Assisted - Leading (PAL) attack, which is the first black - box query - optimized attack against LLMs. PAL utilizes an open - source proxy model to guide the optimization process and designs a loss function specifically for the actual LLM API to reduce the number of queries to the target LLM. In addition, the paper also proposes an improved white - box attack GCG++ and a black - box attack RAL based on random search. These attack methods aim to increase the attack success rate while reducing costs, thus providing powerful tools for evaluating and enhancing the security of LLMs.