Abstract:Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted in many safety-critical applications. Their security is thus essential. Even with considerable efforts spent on reinforcement learning from human feedback (RLHF), recent studies have shown that LLMs are still subject to attacks such as adversarial perturbation and Trojan attacks. Further research is thus needed to evaluate their security and/or understand the lack of it. In this work, we propose a framework for conducting light-weight causality-analysis of LLMs at the token, layer, and neuron level. We applied our framework to open-source LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based on a layer-level causality analysis, we show that RLHF has the effect of overfitting a model to harmful prompts. It implies that such security can be easily overcome by `unusual' harmful prompts. As evidence, we propose an adversarial perturbation method that achieves 100\% attack success rate on the red-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we show the existence of one mysterious neuron in both Llama2 and Vicuna that has an unreasonably high causal effect on the output. While we are uncertain on why such a neuron exists, we show that it is possible to conduct a ``Trojan'' attack targeting that particular neuron to completely cripple the LLM, i.e., we can generate transferable suffixes to prompts that frequently make the LLM produce meaningless responses.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "Causality Analysis for Evaluating the Security of Large Language Models" aims to evaluate the security of large language models (LLMs) and understand why the existing security mechanisms are insufficient. Specifically, the paper attempts to solve the following problems: 1. **Evaluating the security of LLMs**: - Although methods such as reinforcement learning from human feedback (RLHF) have improved the security of LLMs, these models are still vulnerable to security threats such as adversarial perturbations and Trojan attacks. - The paper proposes a lightweight causal analysis framework (Casper) for causal analysis of LLMs at the token, layer, and neuron levels to systematically evaluate their security. 2. **Understanding the deficiencies of existing security mechanisms**: - Through causal analysis, the paper finds that existing security mechanisms (such as RLHF) often achieve "exaggerated" security by over - fitting harmful prompts. - This over - fitting makes the model vulnerable to being overcome by "unusual" adversarial prompts, resulting in the failure of the security mechanism. 3. **Proposing new attack methods**: - Based on the results of causal analysis, the paper proposes a new adversarial perturbation method. By converting harmful prompts into emoticons and attaching them to the beginning of the prompt, a 100% attack success rate is achieved. - Further, the paper discovers a mysterious neuron that exists in both Llama2 and Vicuna. This neuron has an abnormally high causal effect on the model output. By conducting a "Trojan" attack on this neuron, the normal function of the LLM can be completely disrupted. ### Main findings 1. **Over - fitting of security mechanisms**: - Through causal analysis at the level of different types of prompts (benign prompts, harmful prompts, and adversarial prompts), it is found that harmful prompts can cause a significant increase in the causal effect of certain layers (especially layer 3). - This indicates that RLHF achieves security by over - fitting harmful prompts, but this security can be easily overcome by "unusual" adversarial prompts. 2. **New adversarial attack methods**: - A new adversarial perturbation method is proposed. By converting harmful prompts into emoticons and attaching them to the beginning of the prompt, a 100% attack success rate is achieved. - This method avoids the over - fitting effect of RLHF by reducing the causal effect of the first few layers of the model. 3. **Existence of mysterious neurons**: - A mysterious neuron that exists in both Llama2 and Vicuna is discovered. This neuron has an abnormally high causal effect on the model output. - By optimizing and generating specific prompt suffixes, the value of this neuron can be effectively set to 0, causing the model to generate meaningless responses. These suffixes are highly transferable. ### Conclusion The paper systematically evaluates the security of LLMs through the lightweight causal analysis framework Casper and reveals the deficiencies of existing security mechanisms. Based on these findings, the paper proposes new attack methods and discovers mysterious neurons with an abnormally high causal effect on the model output. These findings not only help evaluate the security of LLMs but also provide new ideas for improving their security.

Causality Analysis for Evaluating the Security of Large Language Models

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems

Finding Safety Neurons in Large Language Models

Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Cognitive Overload Attack:Prompt Injection for Long Context

CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking