VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

Zhen Qu,Xian Tao,Mukesh Prasad,Fei Shen,Zhengtao Zhang,Xinyi Gong,Guiguang Ding
2024-07-17
Abstract:Recently, large-scale vision-language models such as CLIP have demonstrated immense potential in zero-shot anomaly segmentation (ZSAS) task, utilizing a unified model to directly detect anomalies on any unseen product with painstakingly crafted text prompts. However, existing methods often assume that the product category to be inspected is known, thus setting product-specific text prompts, which is difficult to achieve in the data privacy scenarios. Moreover, even the same type of product exhibits significant differences due to specific components and variations in the production process, posing significant challenges to the design of text prompts. In this end, we propose a visual context prompting model (VCP-CLIP) for ZSAS task based on CLIP. The insight behind VCP-CLIP is to employ visual context prompting to activate CLIP's anomalous semantic perception ability. In specific, we first design a Pre-VCP module to embed global visual information into the text prompt, thus eliminating the necessity for product-specific prompts. Then, we propose a novel Post-VCP module, that adjusts the text embeddings utilizing the fine-grained features of the images. In extensive experiments conducted on 10 real-world industrial anomaly segmentation datasets, VCP-CLIP achieved state-of-the-art performance in ZSAS task. The code is available at <a class="link-external link-https" href="https://github.com/xiaozhen228/VCP-CLIP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper introduces a visual context prompting model (VCP-CLIP) designed to address the problem of zero-shot anomaly segmentation (ZSAS). The primary goal is to accurately localize and segment anomalies in previously unseen products without relying on customized training data for each product type. The key challenges addressed by VCP-CLIP include: 1. **Unknown Product Categories**: Existing methods often assume that the category of the product being inspected is known, which is impractical in scenarios with data privacy constraints. 2. **Variability Within Product Types**: Even within the same product category, there can be significant variations due to specific components and differences in the production process. 3. **Overfitting to Specific Text Prompts**: Mapping images and text separately into a joint space without interaction can lead to overfitting to certain text prompts, limiting the model's ability to generalize. To overcome these challenges, VCP-CLIP proposes two main components: ### Pre-VCP Module This module integrates global visual information into the text prompt, eliminating the need for product-specific prompts. It uses a small neural network (Mini-Net) to map the global image features into the word embedding space and combines them with learnable vectors representing the product category. ### Post-VCP Module This module adjusts the text embeddings using fine-grained features from the images. It employs a multi-head attention mechanism to compute atten