Abstract:The integration of artificial intelligence (AI) with radiology marks a transformative era in medicine. Vision foundation models have been adopted to enhance radiologic imaging analysis. However, the distinct complexities of radiologic 2D and 3D radiologic data pose unique challenges that existing models, pre-trained on general non-medical images, fail to address adequately. To bridge this gap and capitalize on the diagnostic precision required in radiologic imaging, we introduce Radiologic Contrastive Language-Image Pre-training (RadCLIP): a cross-modal vision-language foundational model that harnesses Vision Language Pre-training (VLP) framework to improve radiologic image analysis. Building upon Contrastive Language-Image Pre-training (CLIP), RadCLIP incorporates a slice pooling mechanism tailored for volumetric image analysis and is pre-trained using a large and diverse dataset of radiologic image-text pairs. The RadCLIP was pre-trained to effectively align radiologic images with their corresponding text annotations, creating a robust vision backbone for radiologic images. Extensive experiments demonstrate RadCLIP's superior performance in both uni-modal radiologic image classification and cross-modal image-text matching, highlighting its significant promise for improving diagnostic accuracy and efficiency in clinical settings. Our Key contributions include curating a large dataset with diverse radiologic 2D/3D radiologic image-text pairs, a slice pooling adapter using an attention mechanism for integrating 2D images, and comprehensive evaluations of RadCLIP on various radiologic downstream tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of existing vision - based models when processing radiological images. Specifically, most of the existing vision - based models are pre - trained on general non - medical images, and these models cannot fully cope with the unique complexity of 2D and 3D radiological data, resulting in poor performance in radiological image analysis tasks. To bridge this gap and improve the accuracy of radiological image diagnosis, the authors propose the Radiologic Contrastive Language - Image Pre - training (RadCLIP) model. ### Main problems: 1. **Limitations of existing models**: Existing vision - based models have deficiencies when processing radiological images, especially in understanding the subtle pathological features of 2D and 3D radiological images. 2. **Lack of data diversity**: Most of the existing vision - language models are mainly trained using 2D chest X - ray or CT scan slices, lacking sufficient 3D radiological image data, which limits the model's ability to understand complex anatomical structures. 3. **Challenges of cross - modal tasks**: The alignment between radiological images and text annotations requires more powerful models to handle complex medical terms and image features. ### Solutions: 1. **RadCLIP model**: By introducing a contrastive language - image pre - training model (RadCLIP) specifically for radiological images, combined with the vision - and - language pre - training framework (VLP), to improve the accuracy of radiological image analysis. 2. **Dataset**: A large and diverse dataset of radiological image - text pairs has been constructed, covering multiple 2D and 3D imaging modalities, anatomical regions, diseases, and medical conditions. 3. **Slice pooling mechanism**: An attention - based slice pooling adapter has been introduced to integrate different 2D image slices, thereby better understanding 3D spatial information. ### Contributions: 1. **Dataset construction**: A dataset containing a large number of 2D and 3D radiological image - text pairs has been collected and organized. 2. **Model training**: The RadCLIP model has been extensively trained using the VLP framework, especially with the introduction of the slice pooling mechanism on 3D images. 3. **Performance evaluation**: The performance of RadCLIP in unimodal radiological image classification and cross - modal image - text matching tasks has been evaluated through a series of experiments, demonstrating its potential in clinical applications. In conclusion, this paper aims to solve the deficiencies of existing vision - based models in radiological image analysis and improve the accuracy and efficiency of diagnosis by proposing the RadCLIP model.

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

CLIP in Medical Imaging: A Comprehensive Survey

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

A Closer Look at the Explainability of Contrastive Language-Image Pre-training

Improving CLIP Training with Language Rewrites

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images

Mammo-CLIP: Leveraging Contrastive Language-Image Pre-training (CLIP) for Enhanced Breast Cancer Diagnosis with Multi-view Mammography

CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning

Diffusion Feedback Helps CLIP See Better