ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

Shengze Li,Jianjian Cao,Peng Ye,Yuhan Ding,Chongjun Tu,Tao Chen
2024-01-29
Abstract:Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper proposes a new framework called ClipSAM for the task of Zero-Shot Anomaly Segmentation (ZSAS). ZSAS aims to accurately locate and segment abnormal regions in images without specific class training samples, which is commonly used in image analysis and industrial quality inspection. The current methods, such as those based on CLIP and SAM, have their limitations. CLIP focuses on global feature alignment, leading to imprecise segmentation of local abnormal parts. SAM may generate a large number of unconstrained masks, requiring complex post-processing. The ClipSAM framework overcomes these limitations by combining CLIP's semantic understanding with SAM's fine-grained segmentation ability. Specifically, it utilizes CLIP for abnormal localization and initial segmentation, and then uses this information as cues for SAM to refine the segmentation. The key innovations mentioned in the paper are as follows: 1. Unified Multi-Scale Cross-Modal Interaction (UMCI) module: This module interacts language and visual features at different scales of CLIP to infer abnormal locations. 2. Multi-Level Mask Refinement (MMR) module: It utilizes the localization information from CLIP as multi-level cues for SAM to generate masks at different levels and fuse them. Experimental results demonstrate that ClipSAM achieves the best segmentation performance on the MVTec-AD and VisA datasets, validating the effectiveness of this approach. Compared to existing CLIP and SAM baselines, ClipSAM shows significant improvements in various metrics.