Abstract:3D medical image analysis is pivotal in numerous clinical applications. However, the scarcity of labeled data and limited generalization capabilities hinder the advancement of AI-empowered models. Radiology reports are easily accessible and can serve as weakly-supervised signals. However, large-scale vision-language pre-training (VLP) remains underexplored in 3D medical image analysis. Specifically, the insufficient investigation into multi-grained radiology semantics and their correlations across patients leads to underutilization of large-scale volume-report data. Considering intra-patient cross-modal semantic consistency and inter-patient semantic correlations, we propose a multi-task VLP method, MG-3D, pre-trained on large-scale data (47.1K), addressing the challenges by the following two aspects: 1) Establishing the correspondence between volume semantics and multi-grained medical knowledge of each patient with cross-modal global alignment and complementary modality-guided local reconstruction, ensuring intra-patient features of different modalities cohesively represent the same semantic content; 2) Correlating inter-patient visual semantics based on fine-grained report correlations across patients, and keeping sensitivity to global individual differences via contrastive learning, enhancing the discriminative feature representation. Furthermore, we delve into the scaling law to explore potential performance improvements. Comprehensive evaluations across nine uni- and cross-modal clinical tasks are carried out to assess model efficacy. Extensive experiments on both internal and external datasets demonstrate the superior transferability, scalability, and generalization of MG-3D, showcasing its potential in advancing feature representation for 3D medical image analysis. Code will be available: <a class="link-external link-https" href="https://github.com/Xuefeng-Ni/MG-3D" rel="external noopener nofollow">this https URL</a>.

Harnessing the Power of Pre-trained Vision-Language Models for Efficient Medical Report Generation

VMEKNet: Visual Memory and External Knowledge Based Network for Medical Report Generation.

PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging

MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training

Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation

Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

Medical Vision-Language Pre-Training for Brain Abnormalities

Customizing General-Purpose Foundation Models for Medical Report Generation

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

Improving Medical Report Generation with Adapter Tuning and Knowledge Enhancement in Vision-Language Foundation Models