Transferring Backdoors between Large Language Models by Knowledge Distillation

Pengzhou Cheng,Zongru Wu,Tianjie Ju,Wei Du,Zhuosheng Zhang Gongshen Liu
2024-08-19
Abstract:Backdoor Attacks have been a serious vulnerability against Large Language Models (LLMs). However, previous methods only reveal such risk in specific models, or present tasks transferability after attacking the pre-trained phase. So, how risky is the model transferability of a backdoor attack? In this paper, we focus on whether existing mini-LLMs may be unconsciously instructed in backdoor knowledge by poisoned teacher LLMs through knowledge distillation (KD). Specifically, we propose ATBA, an adaptive transferable backdoor attack, which can effectively distill the backdoor of teacher LLMs into small models when only executing clean-tuning. We first propose the Target Trigger Generation (TTG) module that filters out a set of indicative trigger candidates from the token list based on cosine similarity distribution. Then, we exploit a shadow model to imitate the distilling process and introduce an Adaptive Trigger Optimization (ATO) module to realize a gradient-based greedy feedback to search optimal triggers. Extensive experiments show that ATBA generates not only positive guidance for student models but also implicitly transfers backdoor knowledge. Our attack is robust and stealthy, with over 80% backdoor transferability, and hopes the attention of security.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **When large - language models (LLMs) are compressed using knowledge distillation (KD), is there a transfer risk of backdoor attacks?** Specifically, the researchers are concerned with: 1. **Is it possible for existing small LLMs to inadvertently inherit backdoor knowledge when learning from a contaminated teacher model through knowledge distillation?** 2. **How to design an effective and transferable backdoor attack method so that this attack can be successfully transferred to the student model during the knowledge distillation process?** To answer these questions, the authors propose **ATBA (Adaptive Transferable Backdoor Attack)**, an adaptive and transferable backdoor attack method. The main goal of ATBA is to verify and demonstrate whether backdoor attacks can be effectively transferred from large teacher models to small student models during the knowledge distillation process, and ensure that this transfer is both robust and difficult to detect. ### Main contributions: 1. **Propose ATBA**: This is the first study on the transfer of backdoor attacks during the knowledge distillation process of LLMs. 2. **Design the Target - Trigger Generation module (TTG)**: Use the cosine similarity distribution to filter out indicative trigger words from the teacher model's vocabulary to achieve implicit backdoor transfer and reduce search complexity. 3. **Introduce the Adaptive - Trigger Optimization module (ATO)**: Based on KD simulation and dynamic greedy search techniques, overcome the text discretization problem and make the trigger more robust. 4. **Experimental verification**: Extensive experiments show that ATBA is highly transferable and effective for student models of different architectures on multiple popular tasks. ### Core challenges: - **Defense mechanisms in the knowledge distillation process**: Many studies have shown that traditional backdoor attacks cannot survive during the KD process. - **Text discretization problem**: Unlike images, the discrete nature of text makes it difficult to directly transfer attack strategies. ### Solutions: - **TTG module**: Filter out indicative trigger words through the cosine similarity distribution to ensure that the trigger words are related to the target task and have sufficient adversarial characteristics. - **ATO module**: By introducing a shadow model to simulate the KD process and using gradient feedback techniques to optimize the trigger words, it can resist the defense mechanisms of KD. Through these methods, ATBA achieves the ability to efficiently and covertly transfer backdoor attacks during the knowledge distillation process, revealing the potential security risks of LLMs in the model - compression scenario.