Model Supply Chain Poisoning: Backdooring Pre-trained Models via Embedding Indistinguishability

Hao Wang,Shangwei Guo,Jialing He,Hangcheng Liu,Tianwei Zhang,Tao Xiang
2024-10-17
Abstract:Pre-trained models (PTMs) are widely adopted across various downstream tasks in the machine learning supply chain. Adopting untrustworthy PTMs introduces significant security risks, where adversaries can poison the model supply chain by embedding hidden malicious behaviors (backdoors) into PTMs. However, existing backdoor attacks to PTMs can only achieve partially task-agnostic and the embedded backdoors are easily erased during the fine-tuning process. This makes it challenging for the backdoors to persist and propagate through the supply chain. In this paper, we propose a novel and severer backdoor attack, TransTroj, which enables the backdoors embedded in PTMs to efficiently transfer in the model supply chain. In particular, we first formalize this attack as an indistinguishability problem between poisoned and clean samples in the embedding space. We decompose embedding indistinguishability into pre- and post-indistinguishability, representing the similarity of the poisoned and reference embeddings before and after the attack. Then, we propose a two-stage optimization that separately optimizes triggers and victim PTMs to achieve embedding indistinguishability. We evaluate TransTroj on four PTMs and six downstream tasks. Experimental results show that our method significantly outperforms SOTA task-agnostic backdoor attacks -- achieving nearly 100\% attack success rate on most downstream tasks -- and demonstrates robustness under various system settings. Our findings underscore the urgent need to secure the model supply chain against such transferable backdoor attacks. The code is available at <a class="link-external link-https" href="https://github.com/haowang-cqu/TransTroj" rel="external noopener nofollow">this https URL</a> .
Cryptography and Security,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to embed backdoor attacks in pre - trained models (PTMs) and enable them to effectively spread in the model supply chain. Specifically, there are two main problems in the existing backdoor attacks against pre - trained models: 1. **Persistence**: The backdoors in existing attacks are easily erased during the fine - tuning process, resulting in the inability of the backdoors to persist. 2. **Task - independence**: Existing attacks can only partially achieve task - independence, that is, the backdoor is effective in some downstream tasks, but may be ineffective in other tasks. To overcome these problems, the authors propose a new backdoor attack method - TransTroj. This method makes the backdoor persist in the pre - trained model and spread across tasks by achieving the indistinguishability between poisoned samples and clean samples in the embedding space. Specifically, the authors formalize the backdoor attack as an embedding indistinguishability problem and decompose it into two parts: pre - indistinguishability and post - indistinguishability: - **Pre - indistinguishability**: Refers to the similarity between poisoned samples and clean samples of the target category in the embedding space. - **Post - indistinguishability**: Refers to the similarity between poisoned samples and clean samples of the target category in the embedding space in the pre - trained model with the embedded backdoor. Through the optimization of these two stages, TransTroj can achieve efficient and persistent backdoor attacks while maintaining the original functions of the model. ### Main Contributions 1. **Propose TransTroj**: A new backdoor attack method with function preservation, persistence, and true task - independence, which can effectively spread in the model supply chain. 2. **Introduce the concept of embedding indistinguishability**: And decompose the indistinguishability into pre - indistinguishability and post - indistinguishability to systematically construct persistent and transferable backdoors. 3. **Design a two - stage optimization framework**: Optimize the trigger and the victim pre - trained model respectively to achieve embedding indistinguishability without sacrificing the performance of the model on clean data. 4. **Provide extensive experimental results**: Verify the effectiveness and robustness of TransTroj on multiple pre - trained models and downstream tasks, achieving an attack success rate close to 100%. ### Related Work The paper divides the existing backdoor attack methods into two categories: - **Task - specific backdoor attacks**: Require specific knowledge of downstream tasks, such as datasets, labels, or training configurations. - **Task - independent backdoor attacks**: Do not require specific knowledge of downstream tasks, but existing methods are difficult to maintain a high attack success rate after fine - tuning and cannot fully achieve task - independence. ### Methodology 1. **Observations and Pipelines**: - Observation 1: Existing backdoor triggers are usually hand - made and are easily forgotten during the fine - tuning process. If the trigger is semantically similar to the target category, the target category samples in the downstream task can be used to maintain the backdoor. - Observation 2: Some studies bind the trigger to pre - defined output representations (PORs), but these PORs usually do not cover the target category. By downloading reference images of the target category from the Internet, better reference embeddings can be obtained. 2. **Transferable Backdoor Attacks**: - Define an optimization problem, with the goal of classifying poisoned samples as the target category in the fine - tuned downstream model. - Bridge the attacker's goals and capabilities through two transformations: transform the misclassification goal into embedding indistinguishability, and transform the access to the downstream dataset into publicly available unlabeled shadow datasets and reference images. 3. **Pre - indistinguishability and Post - indistinguishability**: - Pre - indistinguishability ensures the similarity between poisoned samples and clean samples of the target category in the embedding space. - Post - indistinguishability further strengthens task - independence, ensuring that poisoned samples can be correctly classified as the target category in any downstream task. 4. **Two - stage Optimization**: - **Trigger Optimization**: Optimize the trigger to make the poisoned samples similar to the reference images in the embedding space. - **Victim Pre - trained Model Optimization**: Optimize the model to make the embeddings of poisoned samples indistinguishable from the reference embeddings in the embedding space. ### Experimental Evaluation The paper uses four commonly used pre - trained models (ResNet,