Abstract:Pre-trained models (PTMs) are widely adopted across various downstream tasks in the machine learning supply chain. Adopting untrustworthy PTMs introduces significant security risks, where adversaries can poison the model supply chain by embedding hidden malicious behaviors (backdoors) into PTMs. However, existing backdoor attacks to PTMs can only achieve partially task-agnostic and the embedded backdoors are easily erased during the fine-tuning process. This makes it challenging for the backdoors to persist and propagate through the supply chain. In this paper, we propose a novel and severer backdoor attack, TransTroj, which enables the backdoors embedded in PTMs to efficiently transfer in the model supply chain. In particular, we first formalize this attack as an indistinguishability problem between poisoned and clean samples in the embedding space. We decompose embedding indistinguishability into pre- and post-indistinguishability, representing the similarity of the poisoned and reference embeddings before and after the attack. Then, we propose a two-stage optimization that separately optimizes triggers and victim PTMs to achieve embedding indistinguishability. We evaluate TransTroj on four PTMs and six downstream tasks. Experimental results show that our method significantly outperforms SOTA task-agnostic backdoor attacks -- achieving nearly 100\% attack success rate on most downstream tasks -- and demonstrates robustness under various system settings. Our findings underscore the urgent need to secure the model supply chain against such transferable backdoor attacks. The code is available at <a class="link-external link-https" href="https://github.com/haowang-cqu/TransTroj" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to embed backdoor attacks in pre - trained models (PTMs) and enable them to effectively spread in the model supply chain. Specifically, there are two main problems in the existing backdoor attacks against pre - trained models: 1. **Persistence**: The backdoors in existing attacks are easily erased during the fine - tuning process, resulting in the inability of the backdoors to persist. 2. **Task - independence**: Existing attacks can only partially achieve task - independence, that is, the backdoor is effective in some downstream tasks, but may be ineffective in other tasks. To overcome these problems, the authors propose a new backdoor attack method - TransTroj. This method makes the backdoor persist in the pre - trained model and spread across tasks by achieving the indistinguishability between poisoned samples and clean samples in the embedding space. Specifically, the authors formalize the backdoor attack as an embedding indistinguishability problem and decompose it into two parts: pre - indistinguishability and post - indistinguishability: - **Pre - indistinguishability**: Refers to the similarity between poisoned samples and clean samples of the target category in the embedding space. - **Post - indistinguishability**: Refers to the similarity between poisoned samples and clean samples of the target category in the embedding space in the pre - trained model with the embedded backdoor. Through the optimization of these two stages, TransTroj can achieve efficient and persistent backdoor attacks while maintaining the original functions of the model. ### Main Contributions 1. **Propose TransTroj**: A new backdoor attack method with function preservation, persistence, and true task - independence, which can effectively spread in the model supply chain. 2. **Introduce the concept of embedding indistinguishability**: And decompose the indistinguishability into pre - indistinguishability and post - indistinguishability to systematically construct persistent and transferable backdoors. 3. **Design a two - stage optimization framework**: Optimize the trigger and the victim pre - trained model respectively to achieve embedding indistinguishability without sacrificing the performance of the model on clean data. 4. **Provide extensive experimental results**: Verify the effectiveness and robustness of TransTroj on multiple pre - trained models and downstream tasks, achieving an attack success rate close to 100%. ### Related Work The paper divides the existing backdoor attack methods into two categories: - **Task - specific backdoor attacks**: Require specific knowledge of downstream tasks, such as datasets, labels, or training configurations. - **Task - independent backdoor attacks**: Do not require specific knowledge of downstream tasks, but existing methods are difficult to maintain a high attack success rate after fine - tuning and cannot fully achieve task - independence. ### Methodology 1. **Observations and Pipelines**: - Observation 1: Existing backdoor triggers are usually hand - made and are easily forgotten during the fine - tuning process. If the trigger is semantically similar to the target category, the target category samples in the downstream task can be used to maintain the backdoor. - Observation 2: Some studies bind the trigger to pre - defined output representations (PORs), but these PORs usually do not cover the target category. By downloading reference images of the target category from the Internet, better reference embeddings can be obtained. 2. **Transferable Backdoor Attacks**: - Define an optimization problem, with the goal of classifying poisoned samples as the target category in the fine - tuned downstream model. - Bridge the attacker's goals and capabilities through two transformations: transform the misclassification goal into embedding indistinguishability, and transform the access to the downstream dataset into publicly available unlabeled shadow datasets and reference images. 3. **Pre - indistinguishability and Post - indistinguishability**: - Pre - indistinguishability ensures the similarity between poisoned samples and clean samples of the target category in the embedding space. - Post - indistinguishability further strengthens task - independence, ensuring that poisoned samples can be correctly classified as the target category in any downstream task. 4. **Two - stage Optimization**: - **Trigger Optimization**: Optimize the trigger to make the poisoned samples similar to the reference images in the embedding space. - **Victim Pre - trained Model Optimization**: Optimize the model to make the embeddings of poisoned samples indistinguishable from the reference embeddings in the embedding space. ### Experimental Evaluation The paper uses four commonly used pre - trained models (ResNet,

Model Supply Chain Poisoning: Backdooring Pre-trained Models via Embedding Indistinguishability

B3: Backdoor Attacks Against Black-box Machine Learning Models

ATTEQ-NN: Attention-based QoE-aware Evasive Backdoor Attacks.

Multi-target Backdoor Attacks for Code Pre-trained Models

On Model Outsourcing Adaptive Attacks to Deep Learning Backdoor Defenses

Backdoor Pre-trained Models Can Transfer to All

Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing

Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

The Victim and The Beneficiary: Exploiting a Poisoned Model to Train a Clean Model on Poisoned Data

Effective Backdoor Defense by Exploiting Sensitivity of Poisoned Samples

Seeing Is Not Always Believing: Invisible Collision Attack and Defence on Pre-Trained Models

An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers

Mellivora Capensis: A Backdoor-Free Training Framework on the Poisoned Dataset without Auxiliary Data

Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models

Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats

Partial train and isolate, mitigate backdoor attack

Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning.

Hidden Backdoors in Human-Centric Language Models

Universal Backdoor Attacks