Abstract:Current large language models (LLM) provide a strong foundation for large-scale user-oriented natural language tasks. Many users can easily inject adversarial text or instructions through the user interface, thus causing LLM model security challenges like the language model not giving the correct answer. Although there is currently a large amount of research on black-box attacks, most of these black-box attacks use random and heuristic strategies. It is unclear how these strategies relate to the success rate of attacks and thus effectively improve model robustness. To solve this problem, we propose our target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of the clean text and the attack text to redefine the attack's goal. We transform the distance maximization problem into two convex optimization problems based on the attack goal to solve the attack text and estimate the covariance. Furthermore, the projected gradient descent algorithm solves the vector corresponding to the attack text. Our target-driven black-box attack approach includes two attack strategies: token manipulation and misinformation attack. Experimental results on multiple Large Language Models and datasets demonstrate the effectiveness of our attack method.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security challenges for large - language models (LLMs), especially how to effectively conduct black - box attacks to improve the model's robustness. Currently, although there are a large number of black - box attack studies, most of these attacks use random and heuristic strategies, and the relationship between these strategies and the attack success rate is not clear, so they cannot effectively improve the model's robustness. To solve this problem, the authors propose a goal - driven black - box attack method, which redefines the attack goal by maximizing the KL divergence between the conditional probabilities of clean text and attack text. Specifically, they transform the distance maximization problem into two convex optimization problems, solve the attack text based on the attack goal and estimate the covariance. Further, the projection gradient descent algorithm is used to solve the vector corresponding to the attack text. This method includes two attack strategies: token manipulation and misleading attacks. ### Main contributions: 1. **Propose a new objective function**: Maximize the KL divergence between two conditional probabilities to guide the attack algorithm to achieve the best attack effect. 2. **Theoretical proof**: Maximizing the KL divergence between normal text and attack text is approximately equal to maximizing the Mahalanobis distance between them, which clarifies how these statistics distinguish normal text from attack text in security analysis. 3. **Method implementation**: Transform the original problem into a convex optimization problem, and obtain the vector representation of the attack text through the projection gradient descent algorithm. Based on this, two new black - box attack methods are designed: token manipulation and misleading attack strategies. The experimental results verify the effectiveness of the proposed method. ### Method overview: - **Threat model**: Given a text \(t\) containing multiple sentences, generate a new text \(t'\) to attack large - scale language models like ChatGPT, ensuring that the semantics of the original text \(t\) remain unchanged. If the output of the model \(M\) on \(t\) and \(t'\) is different, then \(t'\) is recognized as an adversarial sample or attack input. - **Objective function**: Maximize the output difference of the model on clean text and attack text while maintaining semantic similarity. - **Optimization problem**: Transform the problem into a convex optimization problem and solve the optimal attack vector through the projection gradient descent algorithm. ### Experimental results: - **Experimental details**: Use ChatGPT and Llama - 2 as victim models, randomly select 300 questions from each dataset for testing, and the evaluation metrics include clean accuracy, attack accuracy, and attack success rate (ASR). - **Main attack results**: On the SQuAD2.0 and Math datasets, the proposed method performs well on multiple versions of ChatGPT and Llama, especially on math problems, with an attack success rate as high as 81.48%. - **Comparison with other methods**: Compared with the existing mainstream black - box attack methods, the proposed method performs better in the zero - sample scenario, especially on the math dataset. ### Summary: This paper effectively improves the attack success rate against large - language models by proposing a goal - driven black - box attack method, thereby helping to improve the model's robustness and security. This method is not only theoretically verified but also shows good performance and transferability in actual experiments.

Target-driven Attack for Large Language Models

TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models

Goal-guided Generative Prompt Injection Attack on Large Language Models

Adversarial Attacks on Large Language Models Using Regularized Relaxation

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Weak-to-Strong Backdoor Attack for Large Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Red Teaming Language Model Detectors with Language Models

Vocabulary Attack to Hijack Large Language Model Applications

Jailbreaker in Jail: Moving Target Defense for Large Language Models

Transferable Adversarial Distribution Learning: Query-efficient Adversarial Attack Against Large Language Models

Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Assessing Adversarial Robustness of Large Language Models: An Empirical Study

Adversarial Evasion Attack Efficiency against Large Language Models

Misusing Tools in Large Language Models With Visual Adversarial Examples

Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT

Recent Advances in Attack and Defense Approaches of Large Language Models