VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

Liangyu Zhong,Joachim Sicking,Fabian Hüger,Hanno Gottschalk

2024-09-26

Abstract:Semantic segmentation networks have achieved significant success under the assumption of independent and identically distributed data. However, these networks often struggle to detect anomalies from unknown semantic classes due to the limited set of visual concepts they are typically trained on. To address this issue, anomaly segmentation often involves fine-tuning on outlier samples, necessitating additional efforts for data collection, labeling, and model retraining. Seeking to avoid this cumbersome work, we take a different approach and propose to incorporate Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness. Additionally, we propose a new scoring function that enables data- and training-free outlier supervision via textual prompts. The resulting VL4AD model, which includes max-logit prompt ensembling and a class-merging strategy, achieves competitive performance on widely used benchmark datasets, thereby demonstrating the potential of vision-language models for pixel-wise anomaly detection.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Existing semantic segmentation networks perform poorly in detecting anomalies of unknown semantic categories, especially when facing unseen object types. These out - of - distribution (OOD) objects may lead to unreliable model predictions, thus causing safety problems such as traffic accidents. Specifically, existing methods usually need to be fine - tuned by collecting and annotating abnormal sample data. This not only consumes a large amount of resources, but also the model can only recognize OOD inputs similar to the collected abnormal samples, and may still be ineffective for other types of OOD inputs. To solve these problems, the author proposes a new method, called Vision - Language Model for Anomaly Detection (VL4AD), which incorporates vision - language models into existing anomaly detectors, and uses pre - trained broad semantic concepts to improve the model's sensitivity to unknown anomalies without additional data collection and model retraining. In addition, the author also proposes a new scoring function, which realizes data - free and training - free anomaly supervision through text prompts, thereby improving the flexibility and robustness of the model in practical applications. In summary, this paper aims to improve the effect of pixel - level anomaly detection by combining the advantages of vision - language models, especially the performance in dealing with unknown anomalous objects.

VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

Vision-Language Models Assisted Unsupervised Video Anomaly Detection

Anomaly Detection by Adapting a pre-trained Vision Language Model

Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection

Harnessing Large Language Models for Training-free Video Anomaly Detection

Human-Free Automated Prompting for Vision-Language Anomaly Detection: Prompt Optimization with Meta-guiding Prompt Scheme

A Diffusion-Based Framework for Multi-Class Anomaly Detection

Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection

Video Anomaly Detection and Explanation via Large Language Models

Open-Vocabulary Video Anomaly Detection

ADAGENT: Anomaly Detection Agent with Multimodal Large Models in Adverse Environments

2nd Place Winning Solution for the CVPR2023 Visual Anomaly and Novelty Detection Challenge: Multimodal Prompting for Data-centric Anomaly Detection

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead

FADE: Few-shot/zero-shot Anomaly Detection Engine using Large Vision-Language Model

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Anomaly-Aware Semantic Segmentation via Style-Aligned OoD Augmentation

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

Improving Vision Anomaly Detection with the Guidance of Language Modality