Abstract:The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data by integrating multiple modalities such as text and images, thereby opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered LMMs that process both images and text, including models such as LLaVA, BakLLaVA, Moondream, Gemini-pro-vision, and GPT-4o, compared to fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct security tasks: 1) a visually evident task of detecting simple triggers, such as small pixel variations in images that could be exploited to access potential backdoors in the models, and 2) a visually non-evident task of malware classification through visual representations. In the visually evident task, some LMMs, such as Gemini-pro-vision and GPT-4o, have demonstrated the potential to achieve good performance with careful prompt engineering, with GPT-4o achieving the highest accuracy and F1-score of 91.9\% and 91\%, respectively. However, the fine-tuned ViT models exhibit perfect performance in this task due to its simplicity. For the visually non-evident task, the results highlight a significant divergence in performance, with ViT models achieving F1-scores of 97.11\% in predicting 25 malware classes and 97.61\% in predicting 5 malware families, whereas LMMs showed suboptimal performance despite iterative prompt improvements. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.

A pen mark is all you need - Incidental prompt injection attacks on Vision Language Models in real-life histopathology

Prompt Injection Attacks on Large Language Models in Oncology

Demonstration of an Adversarial Attack Against a Multimodal Vision Language Model for Pathology Imaging

Visual Prompt Engineering for Medical Vision Language Models in Radiology

Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data

TrojVLM: Backdoor Attack Against Vision Language Models

Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications

Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures

What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

An Early Categorization of Prompt Injection Attacks on Large Language Models

PFPs: Prompt-guided Flexible Pathological Segmentation for Diverse Potential Outcomes Using Large Vision and Language Models

The Role of Prompt Engineering for Multimodal LLM Glaucoma Diagnosis

More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models

Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context