Abstract:Mammography is the primary imaging tool for breast cancer diagnosis. Despite significant strides in applying deep learning to interpret mammography images, efforts that focus predominantly on visual features often struggle with generalization across datasets. We hypothesize that integrating additional modalities in the radiology practice, notably the linguistic features of reports and manifestation features embodying radiological insights, offers a more powerful, interpretable and generalizable representation. In this paper, we announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports. Based on this dataset, we focus on the challanging task of unsupervised pretraining and propose ViKL, a innovative framework that synergizes Visual, Knowledge, and Linguistic features. This framework relies solely on pairing information without the necessity for pathology labels, which are often challanging to acquire. ViKL employs a triple contrastive learning approach to merge linguistic and knowledge-based insights with visual data, enabling both inter-modality and intra-modality feature enhancement. Our research yields significant findings: 1) Integrating reports and manifestations with unsupervised visual pretraining, ViKL substantially enhances the pathological classification and fosters multimodal interactions. 2) Manifestations can introduce a novel hard negative sample selection mechanism. 3) The multimodal features demonstrate transferability across different datasets. 4) The multimodal pretraining approach curbs miscalibrations and crafts a high-quality representation space. The MVKL dataset and ViKL code are publicly available at <a class="link-external link-https" href="https://github.com/wxwxwwxxx/ViKL" rel="external noopener nofollow">this https URL</a> to support a broad spectrum of future research.

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray

Knowledge-enhanced visual-language pre-training on chest radiology images

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

DeViDe: Faceted medical knowledge for improved medical vision-language pre-training

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

K-Diag: Knowledge-enhanced Disease Diagnosis in Radiographic Imaging

Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

MKCL: Medical Knowledge with Contrastive Learning model for radiology report generation

ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features

Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework

MeDSLIP: Medical Dual-Stream Language-Image Pre-training for Fine-grained Alignment

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

VPL: Visual Proxy Learning Framework for Zero-Shot Medical Image Diagnosis