Abstract:Multimodal contrastive learning models (e.g., CLIP) can learn high-quality representations from large-scale image-text datasets, yet they exhibit significant vulnerabilities to backdoor attacks, raising serious safety concerns. In this paper, we disclose that CLIP's vulnerabilities primarily stem from its excessive encoding of class-irrelevant features, which can compromise the model's visual feature resistivity to input perturbations, making it more susceptible to capturing the trigger patterns inserted by backdoor attacks. Inspired by this finding, we propose Repulsive Visual Prompt Tuning (RVPT), a novel defense approach that employs specially designed deep visual prompt tuning and feature-repelling loss to eliminate excessive class-irrelevant features while simultaneously optimizing cross-entropy loss to maintain clean accuracy. Unlike existing multimodal backdoor defense methods that typically require the availability of poisoned data or involve fine-tuning the entire model, RVPT leverages few-shot downstream clean samples and only tunes a small number of parameters. Empirical results demonstrate that RVPT tunes only 0.27\% of the parameters relative to CLIP, yet it significantly outperforms state-of-the-art baselines, reducing the attack success rate from 67.53\% to 2.76\% against SoTA attacks and effectively generalizing its defensive capabilities across multiple datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the significant vulnerability of multimodal contrastive learning models (such as CLIP) to backdoor attacks in downstream tasks. Specifically, due to over - encoding category - independent features (CIFs), the visual features of CLIP are overly sensitive to input perturbations, making it more vulnerable to backdoor attacks. Such attacks will cause the model to misclassify images with specific trigger patterns as target categories during the inference stage. To address this issue, the authors propose **Repulsive Visual Prompt Tuning (RVPT)**, a novel defense method. RVPT eliminates excessive category - independent features through specially - designed deep visual prompt tuning and feature - repulsion loss, while optimizing the cross - entropy loss to maintain the accuracy of clean data. Unlike existing methods, RVPT does not require poisoned data or fine - tuning the entire model. Instead, it utilizes a small number of downstream clean samples and only adjusts a small portion of parameters. ### Specific Problem Description 1. **Vulnerability of CLIP**: - CLIP uses large - scale image - text datasets in the pre - training process. These datasets are usually unfiltered network data and are easily maliciously injected with poisonous data. - CLIP is very sensitive to a small number of poisoned samples. Studies have shown that, compared with traditional supervised models, CLIP can be successfully attacked with less poisoned data. - Once poisoned in the pre - training stage, CLIP will misclassify images with specific trigger patterns as target categories during inference. 2. **Limitations of Existing Defense Methods**: - Existing methods usually need to fine - tune the parameters of the entire model or rely on poisoned data, which is both resource - consuming and unrealistic. ### RVPT's Solution RVPT solves the above problems in the following ways: - **Reducing Category - Independent Features**: Minimize the average cosine similarity between the prompt features and the original features through the feature - repulsion loss (FR Loss), thereby filtering out category - independent features that do not contribute to the cross - entropy loss. - **Maintaining Clean - Data Accuracy**: Ensure the accuracy of the model on clean data through the cross - entropy loss (CE Loss). - **Efficiency**: Only adjust a small number of parameters (0.27% relative to CLIP) and use a small number of downstream clean samples for tuning. ### Experimental Results The experimental results show that RVPT performs well under multiple datasets and various backdoor attacks, can significantly reduce the attack success rate (ASR), and maintain a high clean - data accuracy (CA). For example, in the defense against the state - of - the - art attack BadCLIP, RVPT reduces the attack success rate from 67.53% to 2.76%. In addition, RVPT also demonstrates good generalization ability and can effectively defend against backdoor attacks when the target category is not in the tuning dataset, across datasets, and across domains. ### Summary The paper aims to solve the vulnerability problem of multimodal contrastive learning models such as CLIP when facing backdoor attacks, and proposes an efficient and effective defense method - Repulsive Visual Prompt Tuning (RVPT). By reducing category - independent features and maintaining clean - data accuracy, RVPT not only significantly improves the robustness of the model but also demonstrates excellent generalization ability on multiple attacks and datasets.

Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning

Palette: Physically-Realizable Backdoor Attacks Against Video Recognition Models

BDetCLIP: Multimodal Prompting Contrastive Test-Time Backdoor Detection

Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats

Adversarial Prompt Tuning for Vision-Language Models

Adversarial Backdoor Defense in CLIP

BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning

Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

CleanerCLIP: Fine-grained Counterfactual Semantic Augmentation for Backdoor Defense in Contrastive Learning

Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

Prompt Backdoors in Visual Prompt Learning

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning

Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks

Poisoning and Backdooring Contrastive Learning

Revisiting the Robust Generalization of Adversarial Prompt Tuning

Adversarial Prompt Distillation for Vision-Language Models

Backdoor Contrastive Learning via Bi-level Trigger Optimization

VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models