Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images

Mansi Kakkar,Dattesh Shanbhag,Chandan Aladahalli,Gurunath Reddy M
2024-05-31
Abstract:Vision-language models have emerged as a powerful tool for previously challenging multi-modal classification problem in the medical domain. This development has led to the exploration of automated image description generation for multi-modal clinical scans, particularly for radiology report generation. Existing research has focused on clinical descriptions for specific modalities or body regions, leaving a gap for a model providing entire-body multi-modal descriptions. In this paper, we address this gap by automating the generation of standardized body station(s) and list of organ(s) across the whole body in multi-modal MR and CT radiological images. Leveraging the versatility of the Contrastive Language-Image Pre-training (CLIP), we refine and augment the existing approach through multiple experiments, including baseline model fine-tuning, adding station(s) as a superset for better correlation between organs, along with image and language augmentations. Our proposed approach demonstrates 47.6% performance improvement over baseline PubMedCLIP.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of anatomical structure detection in multimodal medical imaging. Specifically, the research team's goal is to fill the gap in existing studies that describe the entire body in multimodal imaging by automatically generating standardized lists of body parts (stations) and organs. The paper utilizes the Contrastive Language-Image Pre-Training (CLIP) model and improves it through various experiments, including fine-tuning the baseline model, adding body parts as a superset of organs to enhance relevance, and performing image and language augmentation. After these improvements, the proposed solution achieved a 47.6% performance increase over the baseline model PubMedCLIP. The main contributions of the paper include: 1. Analyzing the performance of the PubMedCLIP model on multimodal, multi-label classification tasks, particularly in describing anatomical structures. 2. Fine-tuning PubMedCLIP with approximately 4,000 clinical scan data to enhance the performance of multi-label anatomical detection. 3. Demonstrating the effectiveness of further improving model performance through data augmentation of images and text phrases.