Abstract:Vision-language models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated maliciousness estimation score for distinguishing between benign and malicious samples within this unlabeled mixture, thereby enabling the training of a binary prompt classifier on top. Notably, our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiment shows VLMGuard achieves superior detection results, significantly outperforming state-of-the-art methods. Disclaimer: This paper may contain offensive examples; reader discretion is advised.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the vulnerability of vision - language models (VLMs) to malicious prompts in practical applications. Specifically, the paper focuses on how to detect these malicious prompts to ensure that the content generated by VLMs is reliable and safe. Malicious prompts may manipulate VLMs through text or image inputs, resulting in harmful outputs or triggering unexpected behaviors, which is especially concerning in applications involving critical decisions such as personal assistants. ### Main challenges 1. **Difficulty in data annotation**: The main challenge in constructing a reliable malicious - prompt - detection classifier is the lack of a large amount of annotated benign and malicious sample data. Annotating this data is not only time - consuming but also difficult to scale, especially in the context of evolving generative models and diverse user inputs. 2. **Requirement for real - time detection**: To ensure the security of VLMs in actual deployment, a method that can effectively detect malicious prompts in a real - time environment without additional manual annotation is required. ### Solutions To solve the above problems, the paper proposes a new framework named **VLMG UARD**. This framework utilizes unlabeled user - prompt data to achieve malicious - prompt detection through the following steps: 1. **Latent - space extraction and maliciousness estimation**: Extract embeddings from the VLM's representation and identify the latent subspaces related to malicious prompts through singular - value decomposition (SVD). Calculate the norm of each sample projected onto these principal singular vectors as the maliciousness - estimation score. The formula is as follows: \[ \kappa_i=\frac{1}{k} \sum_{j = 1}^{k} \lambda_j\cdot\langle f_i, v_j\rangle^2 \] where \( f_i \) is the embedding of the \( i \)-th sample, \( v_j \) is the \( j \)-th singular vector, \( \lambda_j \) is the corresponding singular value, and \( k \) is the dimension of the subspace. 2. **Binary - classifier training**: According to the maliciousness - estimation score, divide the samples into a potential malicious - prompt set \( M \) and a candidate - benign - prompt set \( B \), and train a binary classifier \( h_\theta \) to distinguish between these two types of samples. The training objective is to minimize the following risk function: \[ L_{M,B}(h_\theta)=L^+_{M}(h_\theta)+L^-_{B}(h_\theta)=\mathbb{E}_{(x_v^{\text{prompt}}, x_t^{\text{prompt}})\in M}[1\{h_\theta(x_v^{\text{prompt}}, x_t^{\text{prompt}})\leq0\}]+\mathbb{E}_{(x_v^{\text{prompt}}, x_t^{\text{prompt}})\in B}[1\{h_\theta(x_v^{\text{prompt}}, x_t^{\text{prompt}})>0\}] \] ### Experimental results Experiments show that VLMG UARD significantly outperforms existing methods in various types of malicious - prompt - detection tasks. In particular, on the LLaVA and Phi - 3 models, VLMG UARD has an average 13.21% improvement in AUROC (area under the receiver - operating - characteristic curve), demonstrating its superior performance and robustness in practical applications. ### Summary VLMG UARD provides a novel and effective solution that can use unlabeled data for malicious - prompt detection, thereby improving the reliability and security of VLMs in practical applications.

VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Safeguarding System Prompts for LLMs

Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

On Prompt-Driven Safeguarding for Large Language Models

Certifying LLM Safety against Adversarial Prompting

Efficient Detection of Toxic Prompts in Large Language Models

On Evaluating Adversarial Robustness of Large Vision-Language Models

TrojVLM: Backdoor Attack Against Vision Language Models

Adversarial Prompt Tuning for Vision-Language Models

Goal-Oriented Prompt Attack and Safety Evaluation for LLMs

Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts

Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection

Refusing Safe Prompts for Multi-modal Large Language Models

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Automatic and Universal Prompt Injection Attacks against Large Language Models

Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models