Abstract:Recent advancements in large-scale visual-language pre-trained models have led to significant progress in zero-/few-shot anomaly detection within natural image domains. However, the substantial domain divergence between natural and medical images limits the effectiveness of these methodologies in medical anomaly detection. This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. This multi-level adaptation is guided by multi-level, pixel-wise visual-language feature alignment loss functions, which recalibrate the model's focus from object semantics in natural imagery to anomaly identification in medical images. The adapted features exhibit improved generalization across various medical data types, even in zero-shot scenarios where the model encounters unseen medical modalities and anatomical regions during training. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models, with an average AUC improvement of 6.24% and 7.33% for anomaly classification, 2.03% and 2.37% for anomaly segmentation, under the zero-shot and few-shot settings, respectively. Source code is available at:

What problem does this paper attempt to address?

The paper aims to address the problem of Anomaly Detection (AD) in medical images, particularly in achieving a generalizable anomaly detection model under zero-shot and few-shot scenarios. Specifically, the research objectives include: 1. **Addressing Domain Differences**: There are significant domain differences between natural images and medical images, which limit the effectiveness of large-scale vision-language pre-trained models in medical image anomaly detection. 2. **Proposing a Lightweight Multi-level Adaptation and Comparison Framework**: This framework re-utilizes the CLIP model for anomaly detection in medical images, enabling the model to adapt to unseen medical modalities and anatomical regions. 3. **Improving Model Generalization**: Ensuring that the model performs well not only on known data but also exhibits strong anomaly detection capabilities when encountering previously unseen medical image modalities and anatomical regions. To achieve the above objectives, the paper proposes a method comprising the following key components: - **Multi-level Visual Feature Adapter (MVFA)**: By integrating multiple residual adapters into the pre-trained visual encoder, this component progressively enhances visual features at different levels. This process is guided by a multi-level pixel-level visual-language feature alignment loss function. - **Language Feature Formatting**: A two-layer method is used to design text prompts, namely state-level and template-level, to clearly describe normal and abnormal states. - **Visual-Language Feature Alignment**: By optimizing the loss function, the adapted visual features are aligned with the text features, thereby improving the model's detection performance. - **Multi-level Feature Comparison in the Testing Phase**: During the testing phase, the model performs multi-level feature comparison based on zero-shot and few-shot branches to accurately predict image-level anomaly classification and pixel-level anomaly segmentation. Experimental validation shows that the proposed framework exhibits superior performance across various medical image datasets, especially under zero-shot and few-shot settings, achieving significant improvements over existing techniques. These results indicate that the method effectively addresses challenging issues in medical image anomaly detection and holds promise for advancing the related field further.

Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

Domain Adaptation Meets Zero-Shot Learning: an Annotation-Efficient Approach to Multi-Modality Medical Image Segmentation

A Model-Agnostic Framework for Universal Anomaly Detection of Multi-organ and Multi-modal Images

Adapting the Segment Anything Model for Multi-modal Retinal Anomaly Detection and Localization

A Weakly-Supervised Anomaly Detection Method Via Adversarial Training for Medical Images

Anomaly Detection by Adapting a pre-trained Vision Language Model

MediCLIP: Adapting CLIP for Few-shot Medical Image Anomaly Detection

MedicalCLIP: Anomaly-Detection Domain Generalization with Asymmetric Constraints

Feasibility of Universal Anomaly Detection without Knowing the Abnormality in Medical Images

Language Models Meet Anomaly Detection for Better Interpretability and Generalizability

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

Medical Vision-Language Pre-Training for Brain Abnormalities

Few-shot Adaptation of Medical Vision-Language Models

Selective Test-Time Adaptation for Unsupervised Anomaly Detection using Neural Implicit Representations

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

Anomaly Detection for Medical Images Using Self-Supervised and Translation-Consistent Features

VPL: Visual Proxy Learning Framework for Zero-Shot Medical Image Diagnosis

Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains

Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training