CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection

Xuhai Chen,Jiangning Zhang,Guanzhong Tian,Haoyang He,Wuhao Zhang,Yabiao Wang,Chengjie Wang,Yong Liu

2024-03-02

Abstract:This paper considers zero-shot Anomaly Detection (AD), performing AD without reference images of the test objects. We propose a framework called CLIP-AD to leverage the zero-shot capabilities of the large vision-language model CLIP. Firstly, we reinterpret the text prompts design from a distributional perspective and propose a Representative Vector Selection (RVS) paradigm to obtain improved text features. Secondly, we note opposite predictions and irrelevant highlights in the direct computation of the anomaly maps. To address these issues, we introduce a Staged Dual-Path model (SDP) that leverages features from various levels and applies architecture and feature surgery. Lastly, delving deeply into the two phenomena, we point out that the image and text features are not aligned in the joint embedding space. Thus, we introduce a fine-tuning strategy by adding linear layers and construct an extended model SDP+, further enhancing the performance. Abundant experiments demonstrate the effectiveness of our approach, e.g., on MVTec-AD, SDP outperforms the SOTA WinCLIP by +4.2/+10.7 in segmentation metrics F1-max/PRO, while SDP+ achieves +8.3/+20.5 improvements.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily addresses the problem of Zero-shot Anomaly Detection (AD), which involves detecting anomalies without reference images of the test objects. Specifically, the study proposes a framework named CLIP-AD, based on the large-scale vision-language model CLIP, aiming to leverage CLIP's powerful zero-shot capabilities to improve the anomaly detection task. The main contributions of the paper include: 1. **A new perspective on text prompt design**: By reinterpreting the design of text prompts, a paradigm called Representative Vector Selection (RVS) is proposed, providing a new research direction for anomaly classification. 2. **Addressing two unexpected phenomena in anomaly segmentation**: The authors identified two issues when directly computing anomaly maps—opposite prediction and irrelevant highlighting—and introduced a method called Stage Dual Path (SDP) to address these problems. 3. **Solving the feature alignment problem**: It is pointed out that there is an alignment issue between image feature maps and text features in CLIP, and a simple yet effective method (SDP+) is proposed by adding a linear layer to facilitate feature alignment. 4. **Experimental validation**: Extensive experimental results demonstrate that the proposed framework CLIP-AD outperforms existing methods across multiple datasets, achieving significant performance improvements particularly in pixel-level AUROC, F1-max, and PRO metrics. In summary, this paper proposes a novel zero-shot anomaly detection method that significantly enhances anomaly detection performance by improving text prompt design, addressing key issues in anomaly segmentation, and optimizing feature alignment strategies.

CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation

Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection

Anomaly Detection by Adapting a pre-trained Vision Language Model

Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection

PTMNet: Pixel-Text Matching Network for Zero-Shot Anomaly Detection

CLIP3D-AD: Extending CLIP for 3D Few-Shot Anomaly Detection with Multi-View Images Generation

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

FADE: Few-shot/zero-shot Anomaly Detection Engine using Large Vision-Language Model

GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

APRIL-GAN: A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD

VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

PointAD: Comprehending 3D Anomalies from Points and Pixels for Zero-shot 3D Anomaly Detection

Exploring Zero-Shot Anomaly Detection with CLIP in Medical Imaging: Are We There Yet?

A Diffusion-Based Framework for Multi-Class Anomaly Detection

CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIP

Efficient Feature Distillation for Zero-shot Annotation Object Detection