Abstract:Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker's choice. While such black-box attacks have been well explored in both computer vision and natural language processing (NLP), backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we pcricKet1996!ropose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights using Gaussian noise to disturb the feature distribution of the baseline model. We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets. We demonstrate that the training-free architectural backdoor on a large language model poses a genuine threat. Unlike the-state-of-art work, it can survive the rigorous fine-tuning and retraining process, as well as evade output probability-based defense methods (i.e. BDDR). All the code and data is available <a class="link-external link-https" href="https://github.com/SiSL-URI/Arch_Backdoor_LLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of architectural backdoor attacks in large - language models (LLMs). Specifically, the authors propose a new white - box architectural backdoor attack method, which is achieved by embedding backdoor modules in the model architecture. These backdoor modules contain two functional units: a trigger detector and a noise injector. When specific trigger words exist in the input sample, the trigger detector will detect these trigger words and activate the noise injector to inject Gaussian noise into the target layer, thus changing the model's weights and feature distributions, causing the model to behave abnormally in the presence of specific trigger words. ### Main contributions 1. **Propose a novel training - free, white - box architectural backdoor attack**: This attack method does not require retraining the model but achieves backdoor attacks by modifying the network architecture. The backdoor module is embedded in the model architecture and utilizes the parameter changes of the Gaussian distribution to activate backdoor behavior. 2. **Show how to construct an architectural backdoor**: The paper details how to construct backdoor modules in two attack scenarios and formalizes the requirements for successful operation. 3. **Extensive experimental verification**: The authors conducted a large number of experiments on five natural - language - understanding datasets to verify the effectiveness of this attack method and compared it with traditional data - poisoning backdoor attacks and NLP backdoor defense methods. The results show that this method has a high attack success rate and can resist multiple common defense methods. ### Background and motivation Current deep neural networks (DNNs) perform well in various applications, but they face multiple security threats, especially backdoor attacks. Traditional backdoor attacks are usually achieved by injecting poisoned samples into the training data, but this method may become ineffective after the model has been strictly fine - tuned and retrained. Therefore, the authors propose this architecture - based backdoor attack method to improve the robustness and stealth of the attack. ### Method overview 1. **Gaussian - noise - perturbed layer features**: By injecting Gaussian noise into specific network layers, the model's weights and feature distributions are changed, thus affecting the model's performance. 2. **Architectural backdoor module**: The backdoor module consists of a trigger detector and a noise injector. The trigger detector is responsible for detecting trigger words in the input sample, and the noise injector injects noise into the target layer after detecting the trigger words. 3. **Experimental setup**: The authors conducted experiments on five different large - language datasets and evaluated the backdoor attack effects at different insertion points and with different standard deviations. ### Experimental results - **Clean accuracy (CA)**: The accuracy of the backdoor model on non - trigger samples is comparable to that of the clean model, indicating that the backdoor module has little impact on normal inputs. - **Trigger accuracy (TA)**: The accuracy of the backdoor model on trigger samples drops significantly, indicating that the attack method is effective. - **Trigger accuracy ratio (TAR)**: The average TAR is 3.64 times, indicating that the backdoor attack causes significant damage to the model's performance. - **Average Shannon entropy (ASE)**: The ASE value of trigger samples is significantly higher than that of non - trigger samples, indicating that the model's predictions on trigger samples are more random. - **Random attack success rate (RASR)**: The RASR of most datasets is close to 1.0, indicating that the model's predictions on trigger samples are indeed random. ### Conclusion This paper proposes a novel white - box architectural backdoor attack method to attack large - language models by embedding backdoor modules in the model architecture. The experimental results show that this method not only has a high attack success rate but also can resist multiple common defense methods, which has important reference value for existing backdoor attack research.

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks

Data Stealing Attacks against Large Language Models via Backdooring

Hidden Backdoors in Human-Centric Language Models

A Backdoor Attack Scheme with Invisible Triggers Based on Model Architecture Modification

Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation

An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers

Backdoor Attacks for In-Context Learning with Language Models

Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

Training-free Lexical Backdoor Attacks on Language Models

Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

Neutralizing Backdoors through Information Conflicts for Large Language Models

Rethinking Stealthiness of Backdoor Attack Against NLP Models.

Injecting Undetectable Backdoors in Deep Learning and Language Models

Beating Backdoor Attack at Its Own Game

The triggers that open the NLP model backdoors are hidden in the adversarial samples

Architectural Neural Backdoors from First Principles

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models