VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

Liangyu Zhong,Joachim Sicking,Fabian Hüger,Hanno Gottschalk
2024-09-26
Abstract:Semantic segmentation networks have achieved significant success under the assumption of independent and identically distributed data. However, these networks often struggle to detect anomalies from unknown semantic classes due to the limited set of visual concepts they are typically trained on. To address this issue, anomaly segmentation often involves fine-tuning on outlier samples, necessitating additional efforts for data collection, labeling, and model retraining. Seeking to avoid this cumbersome work, we take a different approach and propose to incorporate Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness. Additionally, we propose a new scoring function that enables data- and training-free outlier supervision via textual prompts. The resulting VL4AD model, which includes max-logit prompt ensembling and a class-merging strategy, achieves competitive performance on widely used benchmark datasets, thereby demonstrating the potential of vision-language models for pixel-wise anomaly detection.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Existing semantic segmentation networks perform poorly in detecting anomalies of unknown semantic categories, especially when facing unseen object types. These out - of - distribution (OOD) objects may lead to unreliable model predictions, thus causing safety problems such as traffic accidents. Specifically, existing methods usually need to be fine - tuned by collecting and annotating abnormal sample data. This not only consumes a large amount of resources, but also the model can only recognize OOD inputs similar to the collected abnormal samples, and may still be ineffective for other types of OOD inputs. To solve these problems, the author proposes a new method, called Vision - Language Model for Anomaly Detection (VL4AD), which incorporates vision - language models into existing anomaly detectors, and uses pre - trained broad semantic concepts to improve the model's sensitivity to unknown anomalies without additional data collection and model retraining. In addition, the author also proposes a new scoring function, which realizes data - free and training - free anomaly supervision through text prompts, thereby improving the flexibility and robustness of the model in practical applications. In summary, this paper aims to improve the effect of pixel - level anomaly detection by combining the advantages of vision - language models, especially the performance in dealing with unknown anomalous objects.