Abstract:Backdoor Attacks have been a serious vulnerability against Large Language Models (LLMs). However, previous methods only reveal such risk in specific models, or present tasks transferability after attacking the pre-trained phase. So, how risky is the model transferability of a backdoor attack? In this paper, we focus on whether existing mini-LLMs may be unconsciously instructed in backdoor knowledge by poisoned teacher LLMs through knowledge distillation (KD). Specifically, we propose ATBA, an adaptive transferable backdoor attack, which can effectively distill the backdoor of teacher LLMs into small models when only executing clean-tuning. We first propose the Target Trigger Generation (TTG) module that filters out a set of indicative trigger candidates from the token list based on cosine similarity distribution. Then, we exploit a shadow model to imitate the distilling process and introduce an Adaptive Trigger Optimization (ATO) module to realize a gradient-based greedy feedback to search optimal triggers. Extensive experiments show that ATBA generates not only positive guidance for student models but also implicitly transfers backdoor knowledge. Our attack is robust and stealthy, with over 80% backdoor transferability, and hopes the attention of security.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **When large - language models (LLMs) are compressed using knowledge distillation (KD), is there a transfer risk of backdoor attacks?** Specifically, the researchers are concerned with: 1. **Is it possible for existing small LLMs to inadvertently inherit backdoor knowledge when learning from a contaminated teacher model through knowledge distillation?** 2. **How to design an effective and transferable backdoor attack method so that this attack can be successfully transferred to the student model during the knowledge distillation process?** To answer these questions, the authors propose **ATBA (Adaptive Transferable Backdoor Attack)**, an adaptive and transferable backdoor attack method. The main goal of ATBA is to verify and demonstrate whether backdoor attacks can be effectively transferred from large teacher models to small student models during the knowledge distillation process, and ensure that this transfer is both robust and difficult to detect. ### Main contributions: 1. **Propose ATBA**: This is the first study on the transfer of backdoor attacks during the knowledge distillation process of LLMs. 2. **Design the Target - Trigger Generation module (TTG)**: Use the cosine similarity distribution to filter out indicative trigger words from the teacher model's vocabulary to achieve implicit backdoor transfer and reduce search complexity. 3. **Introduce the Adaptive - Trigger Optimization module (ATO)**: Based on KD simulation and dynamic greedy search techniques, overcome the text discretization problem and make the trigger more robust. 4. **Experimental verification**: Extensive experiments show that ATBA is highly transferable and effective for student models of different architectures on multiple popular tasks. ### Core challenges: - **Defense mechanisms in the knowledge distillation process**: Many studies have shown that traditional backdoor attacks cannot survive during the KD process. - **Text discretization problem**: Unlike images, the discrete nature of text makes it difficult to directly transfer attack strategies. ### Solutions: - **TTG module**: Filter out indicative trigger words through the cosine similarity distribution to ensure that the trigger words are related to the target task and have sufficient adversarial characteristics. - **ATO module**: By introducing a shadow model to simulate the KD process and using gradient feedback techniques to optimize the trigger words, it can resist the defense mechanisms of KD. Through these methods, ATBA achieves the ability to efficiently and covertly transfer backdoor attacks during the knowledge distillation process, revealing the potential security risks of LLMs in the model - compression scenario.

Transferring Backdoors between Large Language Models by Knowledge Distillation

B3: Backdoor Attacks Against Black-box Machine Learning Models

Anti-Distillation Backdoor Attacks: Backdoors Can Really Survive in Knowledge Distillation

Like Teacher, Like Pupil: Transferring Backdoors Via Feature-Based Knowledge Distillation

Weak-to-Strong Backdoor Attack for Large Language Models

Data Stealing Attacks against Large Language Models via Backdooring

Backdoor Pre-trained Models Can Transfer to All

Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation

TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

A Practical Trigger-Free Backdoor Attack on Neural Networks

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Neutralizing Backdoors through Information Conflicts for Large Language Models

Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing

Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks

DLP: towards active defense against backdoor attacks with decoupled learning process

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

Knowledge Distillation of Black-Box Large Language Models

Backdoor Mitigation by Correcting the Distribution of Neural Activations

AdvDoor: Adversarial Backdoor Attack of Deep Learning System