Target-driven Attack for Large Language Models

Chong Zhang,Mingyu Jin,Dong Shu,Taowen Wang,Dongfang Liu,Xiaobo Jin
DOI: https://doi.org/10.3233/FAIA240685
2024-11-09
Abstract:Current large language models (LLM) provide a strong foundation for large-scale user-oriented natural language tasks. Many users can easily inject adversarial text or instructions through the user interface, thus causing LLM model security challenges like the language model not giving the correct answer. Although there is currently a large amount of research on black-box attacks, most of these black-box attacks use random and heuristic strategies. It is unclear how these strategies relate to the success rate of attacks and thus effectively improve model robustness. To solve this problem, we propose our target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of the clean text and the attack text to redefine the attack's goal. We transform the distance maximization problem into two convex optimization problems based on the attack goal to solve the attack text and estimate the covariance. Furthermore, the projected gradient descent algorithm solves the vector corresponding to the attack text. Our target-driven black-box attack approach includes two attack strategies: token manipulation and misinformation attack. Experimental results on multiple Large Language Models and datasets demonstrate the effectiveness of our attack method.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the security challenges for large - language models (LLMs), especially how to effectively conduct black - box attacks to improve the model's robustness. Currently, although there are a large number of black - box attack studies, most of these attacks use random and heuristic strategies, and the relationship between these strategies and the attack success rate is not clear, so they cannot effectively improve the model's robustness. To solve this problem, the authors propose a goal - driven black - box attack method, which redefines the attack goal by maximizing the KL divergence between the conditional probabilities of clean text and attack text. Specifically, they transform the distance maximization problem into two convex optimization problems, solve the attack text based on the attack goal and estimate the covariance. Further, the projection gradient descent algorithm is used to solve the vector corresponding to the attack text. This method includes two attack strategies: token manipulation and misleading attacks. ### Main contributions: 1. **Propose a new objective function**: Maximize the KL divergence between two conditional probabilities to guide the attack algorithm to achieve the best attack effect. 2. **Theoretical proof**: Maximizing the KL divergence between normal text and attack text is approximately equal to maximizing the Mahalanobis distance between them, which clarifies how these statistics distinguish normal text from attack text in security analysis. 3. **Method implementation**: Transform the original problem into a convex optimization problem, and obtain the vector representation of the attack text through the projection gradient descent algorithm. Based on this, two new black - box attack methods are designed: token manipulation and misleading attack strategies. The experimental results verify the effectiveness of the proposed method. ### Method overview: - **Threat model**: Given a text \(t\) containing multiple sentences, generate a new text \(t'\) to attack large - scale language models like ChatGPT, ensuring that the semantics of the original text \(t\) remain unchanged. If the output of the model \(M\) on \(t\) and \(t'\) is different, then \(t'\) is recognized as an adversarial sample or attack input. - **Objective function**: Maximize the output difference of the model on clean text and attack text while maintaining semantic similarity. - **Optimization problem**: Transform the problem into a convex optimization problem and solve the optimal attack vector through the projection gradient descent algorithm. ### Experimental results: - **Experimental details**: Use ChatGPT and Llama - 2 as victim models, randomly select 300 questions from each dataset for testing, and the evaluation metrics include clean accuracy, attack accuracy, and attack success rate (ASR). - **Main attack results**: On the SQuAD2.0 and Math datasets, the proposed method performs well on multiple versions of ChatGPT and Llama, especially on math problems, with an attack success rate as high as 81.48%. - **Comparison with other methods**: Compared with the existing mainstream black - box attack methods, the proposed method performs better in the zero - sample scenario, especially on the math dataset. ### Summary: This paper effectively improves the attack success rate against large - language models by proposing a goal - driven black - box attack method, thereby helping to improve the model's robustness and security. This method is not only theoretically verified but also shows good performance and transferability in actual experiments.