VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

Zhen Qu,Xian Tao,Mukesh Prasad,Fei Shen,Zhengtao Zhang,Xinyi Gong,Guiguang Ding

2024-07-17

Abstract:Recently, large-scale vision-language models such as CLIP have demonstrated immense potential in zero-shot anomaly segmentation (ZSAS) task, utilizing a unified model to directly detect anomalies on any unseen product with painstakingly crafted text prompts. However, existing methods often assume that the product category to be inspected is known, thus setting product-specific text prompts, which is difficult to achieve in the data privacy scenarios. Moreover, even the same type of product exhibits significant differences due to specific components and variations in the production process, posing significant challenges to the design of text prompts. In this end, we propose a visual context prompting model (VCP-CLIP) for ZSAS task based on CLIP. The insight behind VCP-CLIP is to employ visual context prompting to activate CLIP's anomalous semantic perception ability. In specific, we first design a Pre-VCP module to embed global visual information into the text prompt, thus eliminating the necessity for product-specific prompts. Then, we propose a novel Post-VCP module, that adjusts the text embeddings utilizing the fine-grained features of the images. In extensive experiments conducted on 10 real-world industrial anomaly segmentation datasets, VCP-CLIP achieved state-of-the-art performance in ZSAS task. The code is available at <a class="link-external link-https" href="https://github.com/xiaozhen228/VCP-CLIP" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper introduces a visual context prompting model (VCP-CLIP) designed to address the problem of zero-shot anomaly segmentation (ZSAS). The primary goal is to accurately localize and segment anomalies in previously unseen products without relying on customized training data for each product type. The key challenges addressed by VCP-CLIP include: 1. **Unknown Product Categories**: Existing methods often assume that the category of the product being inspected is known, which is impractical in scenarios with data privacy constraints. 2. **Variability Within Product Types**: Even within the same product category, there can be significant variations due to specific components and differences in the production process. 3. **Overfitting to Specific Text Prompts**: Mapping images and text separately into a joint space without interaction can lead to overfitting to certain text prompts, limiting the model's ability to generalize. To overcome these challenges, VCP-CLIP proposes two main components: ### Pre-VCP Module This module integrates global visual information into the text prompt, eliminating the need for product-specific prompts. It uses a small neural network (Mini-Net) to map the global image features into the word embedding space and combines them with learnable vectors representing the product category. ### Post-VCP Module This module adjusts the text embeddings using fine-grained features from the images. It employs a multi-head attention mechanism to compute atten

VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation

GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection

ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection

Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

Anomaly Detection by Adapting a pre-trained Vision Language Model

MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation