CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection

Xuhai Chen,Jiangning Zhang,Guanzhong Tian,Haoyang He,Wuhao Zhang,Yabiao Wang,Chengjie Wang,Yong Liu
2024-03-02
Abstract:This paper considers zero-shot Anomaly Detection (AD), performing AD without reference images of the test objects. We propose a framework called CLIP-AD to leverage the zero-shot capabilities of the large vision-language model CLIP. Firstly, we reinterpret the text prompts design from a distributional perspective and propose a Representative Vector Selection (RVS) paradigm to obtain improved text features. Secondly, we note opposite predictions and irrelevant highlights in the direct computation of the anomaly maps. To address these issues, we introduce a Staged Dual-Path model (SDP) that leverages features from various levels and applies architecture and feature surgery. Lastly, delving deeply into the two phenomena, we point out that the image and text features are not aligned in the joint embedding space. Thus, we introduce a fine-tuning strategy by adding linear layers and construct an extended model SDP+, further enhancing the performance. Abundant experiments demonstrate the effectiveness of our approach, e.g., on MVTec-AD, SDP outperforms the SOTA WinCLIP by +4.2/+10.7 in segmentation metrics F1-max/PRO, while SDP+ achieves +8.3/+20.5 improvements.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the problem of Zero-shot Anomaly Detection (AD), which involves detecting anomalies without reference images of the test objects. Specifically, the study proposes a framework named CLIP-AD, based on the large-scale vision-language model CLIP, aiming to leverage CLIP's powerful zero-shot capabilities to improve the anomaly detection task. The main contributions of the paper include: 1. **A new perspective on text prompt design**: By reinterpreting the design of text prompts, a paradigm called Representative Vector Selection (RVS) is proposed, providing a new research direction for anomaly classification. 2. **Addressing two unexpected phenomena in anomaly segmentation**: The authors identified two issues when directly computing anomaly maps—opposite prediction and irrelevant highlighting—and introduced a method called Stage Dual Path (SDP) to address these problems. 3. **Solving the feature alignment problem**: It is pointed out that there is an alignment issue between image feature maps and text features in CLIP, and a simple yet effective method (SDP+) is proposed by adding a linear layer to facilitate feature alignment. 4. **Experimental validation**: Extensive experimental results demonstrate that the proposed framework CLIP-AD outperforms existing methods across multiple datasets, achieving significant performance improvements particularly in pixel-level AUROC, F1-max, and PRO metrics. In summary, this paper proposes a novel zero-shot anomaly detection method that significantly enhances anomaly detection performance by improving text prompt design, addressing key issues in anomaly segmentation, and optimizing feature alignment strategies.