Abstract:Multimodal Language Models (MMLMs), such as LLaVA and GPT-4V, have shown zero-shot generalization capabilities for understanding images and text across various domains. However, their effectiveness in open-world visual tasks, particularly anomaly detection under challenging conditions, such as low light or poor image quality, has yet to be thoroughly investigated. Assessing the robustness and limitations of MMLMs in these scenarios is essential to ensuring their reliability and safety in real-world applications, where the input image quality can vary significantly. To address this gap, we propose a benchmark comprising 460 images captured under challenging conditions, including low light and blurring. This benchmark is specifically designed to evaluate the anomaly detection capabilities of MMLMs. We assess the performance of state-of-the-art MMLMs, such as Qwen-VL-Max-0809, GPT-4V, Gemini-1.5, Claude3-opus, ERNIE-Bot-4, and SparkDesk-v3.5, across six diverse scenes. Our evaluations indicate that these MMLMs struggle with error detection in adverse scenarios, thereby highlighting the need for further investigation into the underlying causes and potential improvement strategies. To tackle these limitations, we introduce a novel Anomaly Detection Agent (ADAGENT) framework, which is an AI agent framework that combines the "Chain of Critical Self-Reflection" (CCS), specialized toolsets, and "Heuristic Retrieval-Augmented Generation (RAG)" to enhance anomaly detection performance with MMLMs. ADAGENT sequentially evaluates abilities, such as text generation, semantic understanding, contextual comprehension, key information extraction, reasoning, and logical thinking. By implementing this framework, we demonstrated a 15% ~ 30% improvement in the top-3 accuracy for anomaly detection tasks under adverse conditions, compared with baseline approaches.

Language Models Meet Anomaly Detection for Better Interpretability and Generalizability

A Model-Agnostic Framework for Universal Anomaly Detection of Multi-organ and Multi-modal Images

Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

Anomaly Detection by Adapting a pre-trained Vision Language Model

Feasibility of Universal Anomaly Detection without Knowing the Abnormality in Medical Images

Medical Vision-Language Pre-Training for Brain Abnormalities

Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning

Vision-Language Models Assisted Unsupervised Video Anomaly Detection

Brainomaly: Unsupervised Neurologic Disease Detection Utilizing Unannotated T1-weighted Brain MR Images

Fine-tuning language model embeddings to reveal domain knowledge: An explainable artificial intelligence perspective on medical decision making

VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

Rethinking Medical Anomaly Detection in Brain MRI: An Image Quality Assessment Perspective

ADAGENT: Anomaly Detection Agent with Multimodal Large Models in Adverse Environments

Anomaly Detection for Medical Images using Heterogeneous Auto-Encoder

Towards Universal Unsupervised Anomaly Detection in Medical Imaging

Diffusion Models for Medical Anomaly Detection

Research on Anomaly Detection Methodology Combining Large Language Models