Abstract:Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities under-explored. Here, we propose the first adaptation of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and data imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the challenges of applying Contrastive Language-Image Pre-training (CLIP) techniques to mammography. Specifically, the paper focuses on the following issues: 1. **Data and Annotation Limitations**: Mammography datasets typically lack corresponding clinical reports, making traditional vision-language pre-training methods difficult to apply directly. Moreover, even if datasets provide image and tabular annotations, these annotations are often insufficient to generate detailed clinical reports. 2. **Multi-view Nature**: Unlike single-view natural images or chest X-rays, each mammography examination usually includes four high-resolution (approximately 2000x2000 pixels) views, corresponding to the left and right craniocaudal (CC) and mediolateral oblique (MLO) views. This multi-view nature introduces issues of bilateral asymmetry and ipsilateral correspondence, requiring the model to handle these characteristics. 3. **Small Region of Interest**: Lesion areas in mammography are usually relatively small, requiring the model to focus on local details in high-resolution images rather than just overall features. 4. **Data Imbalance**: In mammography images, the vast majority of images do not contain cancer, leading to an image-level data imbalance problem, which further exacerbates pixel-level imbalance. To address these issues, the paper proposes a novel Multi-view and Multi-scale Alignment (MaMA) contrastive language-image pre-training framework, with the main innovations including: - **Structured Report Construction**: Generating structured clinical reports from tabular data using a template method to address the lack of clinical reports. - **Multi-view Contrastive Learning**: Leveraging the multi-view nature of mammography to optimize image-image and image-text contrastive losses, learning the correspondence between multi-view images. - **Symmetric Local Alignment Module**: Actively learning the sentence-patch relationship by calculating the similarity score of each image-text pair, enhancing the model's ability to focus on local details. - **Parameter-Efficient Fine-Tuning**: Combining pre-trained large language models (LLM) with parameter-efficient fine-tuning (PEFT) methods to improve the model's understanding of reports while reducing training parameters and GPU memory costs. The paper validates the MaMA method on two large-scale mammography datasets (EMBED and RSNA-Mammo), showing that the MaMA method significantly outperforms existing baseline methods on multiple tasks, with the model size being only 52% of the largest baseline method.

Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography

Mammo-CLIP: Leveraging Contrastive Language-Image Pre-training (CLIP) for Enhanced Breast Cancer Diagnosis with Multi-view Mammography

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents

CLIP in Medical Imaging: A Comprehensive Survey

Multi-view Local Co-occurrence and Global Consistency Learning Improve Mammogram Classification Generalisation

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images

MGI: Multimodal Contrastive pre-training of Genomic and Medical Imaging

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Contrastive learning-guided multi-meta attention network for breast ultrasound video diagnosis

Multi-level Asymmetric Contrastive Learning for Volumetric Medical Image Segmentation Pre-training

MeDSLIP: Medical Dual-Stream Language-Image Pre-training for Fine-grained Alignment

Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging

Multi-View Convolutional Neural Networks for Mammographic Image Classification