Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography

Yuexi Du,John Onofrey,Nicha C. Dvornek
2024-09-27
Abstract:Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities under-explored. Here, we propose the first adaptation of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and data imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem this paper attempts to address is the challenges of applying Contrastive Language-Image Pre-training (CLIP) techniques to mammography. Specifically, the paper focuses on the following issues: 1. **Data and Annotation Limitations**: Mammography datasets typically lack corresponding clinical reports, making traditional vision-language pre-training methods difficult to apply directly. Moreover, even if datasets provide image and tabular annotations, these annotations are often insufficient to generate detailed clinical reports. 2. **Multi-view Nature**: Unlike single-view natural images or chest X-rays, each mammography examination usually includes four high-resolution (approximately 2000x2000 pixels) views, corresponding to the left and right craniocaudal (CC) and mediolateral oblique (MLO) views. This multi-view nature introduces issues of bilateral asymmetry and ipsilateral correspondence, requiring the model to handle these characteristics. 3. **Small Region of Interest**: Lesion areas in mammography are usually relatively small, requiring the model to focus on local details in high-resolution images rather than just overall features. 4. **Data Imbalance**: In mammography images, the vast majority of images do not contain cancer, leading to an image-level data imbalance problem, which further exacerbates pixel-level imbalance. To address these issues, the paper proposes a novel Multi-view and Multi-scale Alignment (MaMA) contrastive language-image pre-training framework, with the main innovations including: - **Structured Report Construction**: Generating structured clinical reports from tabular data using a template method to address the lack of clinical reports. - **Multi-view Contrastive Learning**: Leveraging the multi-view nature of mammography to optimize image-image and image-text contrastive losses, learning the correspondence between multi-view images. - **Symmetric Local Alignment Module**: Actively learning the sentence-patch relationship by calculating the similarity score of each image-text pair, enhancing the model's ability to focus on local details. - **Parameter-Efficient Fine-Tuning**: Combining pre-trained large language models (LLM) with parameter-efficient fine-tuning (PEFT) methods to improve the model's understanding of reports while reducing training parameters and GPU memory costs. The paper validates the MaMA method on two large-scale mammography datasets (EMBED and RSNA-Mammo), showing that the MaMA method significantly outperforms existing baseline methods on multiple tasks, with the model size being only 52% of the largest baseline method.