ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

Shengze Li,Jianjian Cao,Peng Ye,Yuhan Ding,Chongjun Tu,Tao Chen

2024-01-29

Abstract:Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper proposes a new framework called ClipSAM for the task of Zero-Shot Anomaly Segmentation (ZSAS). ZSAS aims to accurately locate and segment abnormal regions in images without specific class training samples, which is commonly used in image analysis and industrial quality inspection. The current methods, such as those based on CLIP and SAM, have their limitations. CLIP focuses on global feature alignment, leading to imprecise segmentation of local abnormal parts. SAM may generate a large number of unconstrained masks, requiring complex post-processing. The ClipSAM framework overcomes these limitations by combining CLIP's semantic understanding with SAM's fine-grained segmentation ability. Specifically, it utilizes CLIP for abnormal localization and initial segmentation, and then uses this information as cues for SAM to refine the segmentation. The key innovations mentioned in the paper are as follows: 1. Unified Multi-Scale Cross-Modal Interaction (UMCI) module: This module interacts language and visual features at different scales of CLIP to infer abnormal locations. 2. Multi-Level Mask Refinement (MMR) module: It utilizes the localization information from CLIP as multi-level cues for SAM to generate masks at different levels and fuse them. Experimental results demonstrate that ClipSAM achieves the best segmentation performance on the MVTec-AD and VisA datasets, validating the effectiveness of this approach. Compared to existing CLIP and SAM baselines, ClipSAM shows significant improvements in various metrics.

ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

SAM Fails to Segment Anything? – SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More

MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation

AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

Evaluation Study on SAM 2 for Class-agnostic Instance-level Segmentation

VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

PosSAM: Panoptic Open-vocabulary Segment Anything

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation

SimSAM: Zero-shot Medical Image Segmentation via Simulated Interaction

Adaptive Prompt Learning with SAM for Few-shot Scanning Probe Microscope Image Segmentation

Segment Anything with Multiple Modalities

MeSAM: Multiscale Enhanced Segment Anything Model for Optical Remote Sensing Images

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

SAM-LAD: Segment Anything Model Meets Zero-Shot Logic Anomaly Detection

When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

FocSAM: Delving Deeply into Focused Objects in Segmenting Anything

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Augmenting Efficient Real-time Surgical Instrument Segmentation in Video with Point Tracking and Segment Anything