Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

Chaoqin Huang,Aofan Jiang,Jinghao Feng,Ya Zhang,Xinchao Wang,Yanfeng Wang
2024-03-19
Abstract:Recent advancements in large-scale visual-language pre-trained models have led to significant progress in zero-/few-shot anomaly detection within natural image domains. However, the substantial domain divergence between natural and medical images limits the effectiveness of these methodologies in medical anomaly detection. This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. This multi-level adaptation is guided by multi-level, pixel-wise visual-language feature alignment loss functions, which recalibrate the model's focus from object semantics in natural imagery to anomaly identification in medical images. The adapted features exhibit improved generalization across various medical data types, even in zero-shot scenarios where the model encounters unseen medical modalities and anatomical regions during training. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models, with an average AUC improvement of 6.24% and 7.33% for anomaly classification, 2.03% and 2.37% for anomaly segmentation, under the zero-shot and few-shot settings, respectively. Source code is available at:
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of Anomaly Detection (AD) in medical images, particularly in achieving a generalizable anomaly detection model under zero-shot and few-shot scenarios. Specifically, the research objectives include: 1. **Addressing Domain Differences**: There are significant domain differences between natural images and medical images, which limit the effectiveness of large-scale vision-language pre-trained models in medical image anomaly detection. 2. **Proposing a Lightweight Multi-level Adaptation and Comparison Framework**: This framework re-utilizes the CLIP model for anomaly detection in medical images, enabling the model to adapt to unseen medical modalities and anatomical regions. 3. **Improving Model Generalization**: Ensuring that the model performs well not only on known data but also exhibits strong anomaly detection capabilities when encountering previously unseen medical image modalities and anatomical regions. To achieve the above objectives, the paper proposes a method comprising the following key components: - **Multi-level Visual Feature Adapter (MVFA)**: By integrating multiple residual adapters into the pre-trained visual encoder, this component progressively enhances visual features at different levels. This process is guided by a multi-level pixel-level visual-language feature alignment loss function. - **Language Feature Formatting**: A two-layer method is used to design text prompts, namely state-level and template-level, to clearly describe normal and abnormal states. - **Visual-Language Feature Alignment**: By optimizing the loss function, the adapted visual features are aligned with the text features, thereby improving the model's detection performance. - **Multi-level Feature Comparison in the Testing Phase**: During the testing phase, the model performs multi-level feature comparison based on zero-shot and few-shot branches to accurately predict image-level anomaly classification and pixel-level anomaly segmentation. Experimental validation shows that the proposed framework exhibits superior performance across various medical image datasets, especially under zero-shot and few-shot settings, achieving significant improvements over existing techniques. These results indicate that the method effectively addresses challenging issues in medical image anomaly detection and holds promise for advancing the related field further.