Abstract:A visual-language model (VLM) pre-trained on natural images and text pairs poses a significant barrier when applied to medical contexts due to domain shift. Yet, adapting or fine-tuning these VLMs for medical use presents considerable hurdles, including domain misalignment, limited access to extensive datasets, and high-class imbalances. Hence, there is a pressing need for strategies to effectively adapt these VLMs to the medical domain, as such adaptations would prove immensely valuable in healthcare applications. In this study, we propose a framework designed to adeptly tailor VLMs to the medical domain, employing selective sampling and hard-negative mining techniques for enhanced performance in retrieval tasks. We validate the efficacy of our proposed approach by implementing it across two distinct VLMs: the in-domain VLM (MedCLIP) and out-of-domain VLMs (ALBEF). We assess the performance of these models both in their original off-the-shelf state and after undergoing our proposed training strategies, using two extensive datasets containing mammograms and their corresponding reports. Our evaluation spans zero-shot, few-shot, and supervised scenarios. Through our approach, we observe a notable enhancement in Recall@K performance for the image-text retrieval task.

What problem does this paper attempt to address?

This paper mainly discusses how to effectively adapt pre-trained vision-language models (VLMs) to the medical field, specifically for the task of screening mammograms. Directly applying natural image and text pre-trained VLMs to medical image and text datasets faces domain transfer issues, leading to performance degradation, due to the differences in vocabulary and features between the medical field and natural images, as well as the problem of data imbalance and a small number of exceptional cases. To address these issues, the paper proposes a knowledge-based adaptation strategy, including selective sampling and hard example mining techniques, to improve the model's performance in retrieval tasks. This method improves model learning by ensuring that negative samples come from true negatives and balancing the representation of rare cases during training. Specifically, they group cases based on key concepts in breast imaging reports and ensure that image and text pairs from different groups are sampled as negative samples. The effectiveness of this approach is validated on two different VLMs, MedCLIP (in-domain model) and ALBEF (out-of-domain model), evaluated in zero-shot, few-shot, and supervised learning scenarios. The experimental results demonstrate that the proposed strategy significantly improves the Recall@K metric for image-to-text and text-to-image retrieval tasks. Furthermore, the paper discusses the adjustment of the sampling strategy to reduce the batch size in limited resource settings, which is particularly useful for resource-constrained environments. Although in some cases, such as the image-to-text retrieval task of MedCLIP on external datasets, significant improvements were not observed, overall, this knowledge-guided adaptation strategy provides new insights for training multimodal networks in the medical field.

Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Few-shot Adaptation of Medical Vision-Language Models

MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification

Medical Vision-Language Pre-Training for Brain Abnormalities

The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

Disease-informed Adaptation of Vision-Language Models

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis

A vision-language model with multi-granular knowledge fusion in medical imaging

Specialist vision-language models for clinical ophthalmology

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis

Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review

Towards General Purpose Medical AI: Continual Learning Medical Foundation Model