Mudjacking: Patching Backdoor Vulnerabilities in Foundation Models

Hongbin Liu,Michael K. Reiter,Neil Zhenqiang Gong
DOI: https://doi.org/10.48550/arXiv.2402.14977
2024-02-23
Abstract:Foundation model has become the backbone of the AI ecosystem. In particular, a foundation model can be used as a general-purpose feature extractor to build various downstream classifiers. However, foundation models are vulnerable to backdoor attacks and a backdoored foundation model is a single-point-of-failure of the AI ecosystem, e.g., multiple downstream classifiers inherit the backdoor vulnerabilities simultaneously. In this work, we propose Mudjacking, the first method to patch foundation models to remove backdoors. Specifically, given a misclassified trigger-embedded input detected after a backdoored foundation model is deployed, Mudjacking adjusts the parameters of the foundation model to remove the backdoor. We formulate patching a foundation model as an optimization problem and propose a gradient descent based method to solve it. We evaluate Mudjacking on both vision and language foundation models, eleven benchmark datasets, five existing backdoor attacks, and thirteen adaptive backdoor attacks. Our results show that Mudjacking can remove backdoor from a foundation model while maintaining its utility.
Cryptography and Security,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the backdoor vulnerability in foundation models. Specifically, as a core component of the AI ecosystem, foundation models are widely used to build various downstream classifiers. However, these foundation models are vulnerable to backdoor attacks, causing multiple downstream classifiers to inherit the same backdoor vulnerability. This puts the entire AI ecosystem at risk of single - point failure. ### Main problems and challenges in the paper 1. **Threat of backdoor attacks**: When a backdoor is implanted in a foundation model, an attacker can embed a specific trigger in the input data to make the foundation model generate the feature vector expected by the attacker. For example, a white square in an image or a specific word in text can be used as a trigger, causing the model to misclassify the input as the target category. 2. **Risk of single - point failure**: Since multiple downstream classifiers rely on the same foundation model, once the foundation model is implanted with a backdoor, all downstream applications using this foundation model will be affected. This means that even if the training data and process of the downstream classifiers remain intact, they will still inherit the backdoor vulnerability of the foundation model. 3. **Insufficiency of existing patching methods**: Existing model patching methods are mainly aimed at ordinary bug fixing, rather than specifically for backdoor attacks. These methods are not effective in dealing with the backdoor vulnerabilities of foundation models because they cannot effectively identify and remove triggers. ### Proposed solution: Mudjacking To solve the above problems, the paper proposes Mudjacking, which is the first method specifically designed to patch foundation models to remove backdoor vulnerabilities. Mudjacking is achieved through the following steps: 1. **Define bug instances**: Mudjacking considers a scenario where a foundation model with a backdoor has been deployed, and the client detects that its downstream classifier has misclassified an input. The client reports this misclassified instance to the foundation model provider, who then uses Mudjacking to adjust the parameters of the foundation model to remove the backdoor. 2. **Three patching goals**: - **Effectiveness**: The patched foundation model can correctly classify the misclassified input. - **Locality**: The patch should not affect the prediction results of other inputs. - **Generalizability**: If the misclassified input comes from a backdoor attack, the patched foundation model should be able to correctly classify other inputs with the same trigger. 3. **Formulation of the optimization problem**: Mudjacking defines patching the foundation model as an optimization problem, achieving the above three goals by minimizing the weighted sum of three loss functions. These three loss functions are: - **Effectiveness Loss**: Measures the similarity of the feature vectors generated by the foundation model for the misclassified input and the reference input before and after patching. - **Locality Loss**: Measures the similarity of the feature vectors generated by the foundation model for clean inputs in the validation set before and after patching. - **Generalizability Loss**: Measures the similarity of the feature vectors generated by the patched foundation model for the input with a trigger and its original version. 4. **Solve by gradient descent method**: Mudjacking uses a gradient - based descent method to solve the above optimization problem, gradually updating the parameters of the foundation model until the patching goals are met. 5. **Trigger reverse engineering**: In order to calculate the generalizability loss, it is necessary to identify the trigger in the misclassified input. Mudjacking uses interpretable machine learning methods to automatically reverse - engineer the trigger from the misclassified input. ### Experimental results The paper evaluates the effectiveness of Mudjacking on multiple benchmark datasets and backdoor attacks. The experimental results show that Mudjacking can effectively remove the backdoor vulnerabilities in foundation models while maintaining their functionality and accuracy. In conclusion, this paper proposes an innovative method to patch backdoor vulnerabilities in foundation models, significantly improving the security of the AI ecosystem.