Abstract:We introduce a new type of indirect injection vulnerabilities in language models that operate on images: hidden "meta-instructions" that influence how the model interprets the image and steer the model's outputs to express an adversary-chosen style, sentiment, or point of view. We explain how to create meta-instructions by generating images that act as soft prompts. Unlike jailbreaking attacks and adversarial examples, the outputs resulting from these images are plausible and based on the visual content of the image, yet follow the adversary's (meta-)instructions. We describe the risks of these attacks, including misinformation and spin, evaluate their efficacy for multiple visual language models and adversarial meta-objectives, and demonstrate how they can "unlock" the capabilities of the underlying language models that are unavailable via explicit text instructions. Finally, we discuss defenses against these attacks.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore and solve a new type of indirect injection attack in visual - language models (VLMs). This attack influences the model's understanding of images through hidden "meta - instructions" and guides the model to output content that conforms to the style, emotion or opinion selected by the attacker. Specifically: 1. **Introducing a new type of attack**: - The paper presents a new type of indirect injection vulnerability, that is, by generating specific image perturbations, which act as soft prompts and can subtly influence the output of visual - language models. 2. **Attack mechanism**: - The attacker generates images containing meta - instructions by adding small perturbations to legitimate images. When users inquire about these images, the visual - language model will generate outputs that meet the attacker's intentions according to the meta - instructions. These outputs seem reasonable and are based on the image content, but are actually the result of being manipulated by the attacker. 3. **Risk assessment**: - Researchers have evaluated the risks of these attacks, including spam dissemination, false information and public opinion manipulation, etc. They have also demonstrated the effectiveness of these attacks on multiple visual - language models and discussed potential defense measures. 4. **Technical contributions**: - The paper designs, implements and evaluates a method for creating new types of image perturbations. These perturbations can not only maintain the visual semantics of the image, but also guide the model to generate output according to the meta - instructions. The research results show that these perturbations are more effective than explicit instructions and can even unlock some functions that could not be achieved by text instructions originally. 5. **Defense and security**: - Finally, the paper discusses and evaluates several methods to defend against such attacks to ensure the security and reliability of visual - language models. ### Definition of meta - instructions In the context of this article, **meta - instructions** refer to instructions that can control the model's behavior and ensure that the generated output meets certain specific criteria or requirements. These meta - instructions are designed based on the model's visual input, with the aim of enhancing the model's performance, so that its output not only conforms to the context, but also has the emotion, style or tendency selected by the attacker. For example, meta - instructions can guide the model to generate positive or negative emotional expressions, or insert a specific URL in the answer. Importantly, the output generated by meta - instructions must maintain the consistency of image semantics to ensure the rationality of the output and maintain user trust. ### Formula representation To understand the effect of meta - instructions more clearly, we can represent it with the following formulas: - Let \( t^* \) be a meta - instruction, which makes the output \( y_z\in Y \) generated by the model satisfy the meta - goal \( z\in Z \) selected by the attacker. - Define the predicate \( \alpha: Y\times Z\rightarrow \{0, 1\} \), which is True when the output \( y\in Y \) satisfies the meta - goal \( z\in Z \). - Define the predicate \( \beta: P\times X\times Y\rightarrow \{0, 1\} \), which is True when the output \( y\in Y \) is an appropriate response to the question \( p \) of the image \( x \). Therefore, if the model's output satisfies both the meta - instruction and correctly responds to the user's question, then: \[ \alpha(\theta(p, x), z)=\beta(p, x, \theta(p, x)) = 1 \] In this way, the paper shows how to effectively manipulate the output of visual - language models using meta - instructions without destroying the image semantics.

Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions

Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

Backdooring Bias into Text-to-Image Models

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

Manipulating and Mitigating Generative Model Biases without Retraining

Adversarial Prompt Tuning for Vision-Language Models

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger

An Image Is Worth 1000 Lies: Adversarial Transferability across Prompts on Vision-Language Models

Hidden Backdoors in Human-Centric Language Models

Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis

Replace-then-Perturb: Targeted Adversarial Attacks With Visual Reasoning for Vision-Language Models

VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models

Natural Language Induced Adversarial Images

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model