Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications

Fouad Trad,Ali Chehab
2024-06-10
Abstract:The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data by integrating multiple modalities such as text and images, thereby opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered LMMs that process both images and text, including models such as LLaVA, BakLLaVA, Moondream, Gemini-pro-vision, and GPT-4o, compared to fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct security tasks: 1) a visually evident task of detecting simple triggers, such as small pixel variations in images that could be exploited to access potential backdoors in the models, and 2) a visually non-evident task of malware classification through visual representations. In the visually evident task, some LMMs, such as Gemini-pro-vision and GPT-4o, have demonstrated the potential to achieve good performance with careful prompt engineering, with GPT-4o achieving the highest accuracy and F1-score of 91.9\% and 91\%, respectively. However, the fine-tuned ViT models exhibit perfect performance in this task due to its simplicity. For the visually non-evident task, the results highlight a significant divergence in performance, with ViT models achieving F1-scores of 97.11\% in predicting 25 malware classes and 97.61\% in predicting 5 malware families, whereas LMMs showed suboptimal performance despite iterative prompt improvements. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.
Artificial Intelligence,Cryptography and Security,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the issue of evaluating the performance differences between large multimodal models (LMMs) optimized through prompt engineering and fine-tuned Vision Transformers (ViT) models in image-based security applications. Specifically, the study focuses on two distinct security tasks: 1. **Visual Obvious Task**: Detecting simple triggers, such as small pixel changes in images, which might be exploited to access potential backdoors in the model. 2. **Visual Non-Obvious Task**: Classifying malware through visual representations, which requires analyzing complex visual patterns to accurately identify different types of malware. By comparing the performance of these two models on the aforementioned tasks, the paper aims to explore the following questions: - Can large multimodal models, after being optimized through prompt engineering, achieve performance comparable to fine-tuned Vision Transformers in image-based security tasks? - Which model has an advantage in specific security tasks? - How significant is the impact of prompt engineering on the performance improvement of large multimodal models? Through this research, the paper hopes to provide valuable references for future model selection and application in the field of cybersecurity.