Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Shantanu Ghosh,Clare B. Poynton,Shyam Visweswaran,Kayhan Batmanghelich
2024-05-22
Abstract:The lack of large and diverse training data on Computer-Aided Diagnosis (CAD) in breast cancer detection has been one of the concerns that impedes the adoption of the system. Recently, pre-training with large-scale image text datasets via Vision-Language models (VLM) (\eg CLIP) partially addresses the issue of robustness and data efficiency in computer vision (CV). This paper proposes Mammo-CLIP, the first VLM pre-trained on a substantial amount of screening mammogram-report pairs, addressing the challenges of dataset diversity and size. Our experiments on two public datasets demonstrate strong performance in classifying and localizing various mammographic attributes crucial for breast cancer detection, showcasing data efficiency and robustness similar to CLIP in CV. We also propose Mammo-FActOR, a novel feature attribution method, to provide spatial interpretation of representation with sentence-level granularity within mammography reports. Code is available publicly: \url{
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address two major issues faced by Computer-Aided Diagnosis (CAD) systems in breast cancer detection: 1. **Insufficient dataset size and diversity**: Existing breast cancer detection datasets are typically small in scale and lack diversity, which limits the performance and generalization ability of CAD systems. Large-scale and diverse training data are crucial for improving the robustness and data efficiency of models. 2. **Processing high-resolution images**: Traditional vision-language models (such as CLIP) have limitations when processing high-resolution medical images because these models usually need to reduce the image resolution to fit the input dimensions, leading to the loss of critical semantic visual cues. To tackle these issues, the paper proposes the following solutions: - **Mammo-CLIP**: This is the first vision-language model (VLM) pre-trained on a large number of screening mammography report pairs, aiming to improve the model's generalization ability and data efficiency through multi-view supervision (MVS) and data augmentation strategies. - **Mammo-FActOR**: This is a new feature attribution method that enhances the model's interpretability by aligning visual representations with textual attributes in the reports. Through these methods, the paper demonstrates the strong performance of Mammo-CLIP in classifying and localizing various mammographic attributes, particularly excelling in data efficiency and robustness. Additionally, Mammo-FActOR successfully localizes attributes without relying on real bounding boxes, further enhancing the model's practicality and interpretability.