Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training

Aisha Urooj Khan,John Garrett,Tyler Bradshaw,Lonie Salkowski,Jiwoong Jason Jeong,Amara Tariq,Imon Banerjee
2024-05-30
Abstract:A visual-language model (VLM) pre-trained on natural images and text pairs poses a significant barrier when applied to medical contexts due to domain shift. Yet, adapting or fine-tuning these VLMs for medical use presents considerable hurdles, including domain misalignment, limited access to extensive datasets, and high-class imbalances. Hence, there is a pressing need for strategies to effectively adapt these VLMs to the medical domain, as such adaptations would prove immensely valuable in healthcare applications. In this study, we propose a framework designed to adeptly tailor VLMs to the medical domain, employing selective sampling and hard-negative mining techniques for enhanced performance in retrieval tasks. We validate the efficacy of our proposed approach by implementing it across two distinct VLMs: the in-domain VLM (MedCLIP) and out-of-domain VLMs (ALBEF). We assess the performance of these models both in their original off-the-shelf state and after undergoing our proposed training strategies, using two extensive datasets containing mammograms and their corresponding reports. Our evaluation spans zero-shot, few-shot, and supervised scenarios. Through our approach, we observe a notable enhancement in Recall@K performance for the image-text retrieval task.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper mainly discusses how to effectively adapt pre-trained vision-language models (VLMs) to the medical field, specifically for the task of screening mammograms. Directly applying natural image and text pre-trained VLMs to medical image and text datasets faces domain transfer issues, leading to performance degradation, due to the differences in vocabulary and features between the medical field and natural images, as well as the problem of data imbalance and a small number of exceptional cases. To address these issues, the paper proposes a knowledge-based adaptation strategy, including selective sampling and hard example mining techniques, to improve the model's performance in retrieval tasks. This method improves model learning by ensuring that negative samples come from true negatives and balancing the representation of rare cases during training. Specifically, they group cases based on key concepts in breast imaging reports and ensure that image and text pairs from different groups are sampled as negative samples. The effectiveness of this approach is validated on two different VLMs, MedCLIP (in-domain model) and ALBEF (out-of-domain model), evaluated in zero-shot, few-shot, and supervised learning scenarios. The experimental results demonstrate that the proposed strategy significantly improves the Recall@K metric for image-to-text and text-to-image retrieval tasks. Furthermore, the paper discusses the adjustment of the sampling strategy to reduce the batch size in limited resource settings, which is particularly useful for resource-constrained environments. Although in some cases, such as the image-to-text retrieval task of MedCLIP on external datasets, significant improvements were not observed, overall, this knowledge-guided adaptation strategy provides new insights for training multimodal networks in the medical field.