Abstract:3D medical image analysis is pivotal in numerous clinical applications. However, the scarcity of labeled data and limited generalization capabilities hinder the advancement of AI-empowered models. Radiology reports are easily accessible and can serve as weakly-supervised signals. However, large-scale vision-language pre-training (VLP) remains underexplored in 3D medical image analysis. Specifically, the insufficient investigation into multi-grained radiology semantics and their correlations across patients leads to underutilization of large-scale volume-report data. Considering intra-patient cross-modal semantic consistency and inter-patient semantic correlations, we propose a multi-task VLP method, MG-3D, pre-trained on large-scale data (47.1K), addressing the challenges by the following two aspects: 1) Establishing the correspondence between volume semantics and multi-grained medical knowledge of each patient with cross-modal global alignment and complementary modality-guided local reconstruction, ensuring intra-patient features of different modalities cohesively represent the same semantic content; 2) Correlating inter-patient visual semantics based on fine-grained report correlations across patients, and keeping sensitivity to global individual differences via contrastive learning, enhancing the discriminative feature representation. Furthermore, we delve into the scaling law to explore potential performance improvements. Comprehensive evaluations across nine uni- and cross-modal clinical tasks are carried out to assess model efficacy. Extensive experiments on both internal and external datasets demonstrate the superior transferability, scalability, and generalization of MG-3D, showcasing its potential in advancing feature representation for 3D medical image analysis. Code will be available: <a class="link-external link-https" href="https://github.com/Xuefeng-Ni/MG-3D" rel="external noopener nofollow">this https URL</a>.

MGI: Multimodal Contrastive pre-training of Genomic and Medical Imaging

Mmformer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation

Gene-Induced Multimodal Pre-training for Image-Omic Classification

BrainMVP: Multi-modal Vision Pre-training for Brain Image Analysis using Multi-parametric MRI

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

Using Multi-Task Learning to Improve Diagnostic Performance of Convolutional Neural Networks

MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning

Advancing Efficient Brain Tumor Multi-Class Classification -- New Insights from the Vision Mamba Model in Transfer Learning

MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report

ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics

Medical Vision-Language Pre-Training for Brain Abnormalities

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

MGCT: Mutual-Guided Cross-Modality Transformer for Survival Outcome Prediction using Integrative Histopathology-Genomic Features

Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging

MBFusion: Multi-modal balanced fusion and multi-task learning for cancer diagnosis and prognosis

Cross‐Modal Graph Contrastive Learning with Cellular Images

MoVL:Exploring Fusion Strategies for the Domain-Adaptive Application of Pretrained Models in Medical Imaging Tasks

MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training