T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training

Che Liu,Cheng Ouyang,Yinda Chen,Cesar César Quilodrán-Casas,Lei Ma,Jie Fu,Yike Guo,Anand Shah,Wenjia Bai,Rossella Arcucci

2023-12-05

Abstract:Expert annotation of 3D medical image for downstream analysis is resource-intensive, posing challenges in clinical applications. Visual self-supervised learning (vSSL), though effective for learning visual invariance, neglects the incorporation of domain knowledge from medicine. To incorporate medical knowledge into visual representation learning, vision-language pre-training (VLP) has shown promising results in 2D image. However, existing VLP approaches become generally impractical when applied to high-resolution 3D medical images due to GPU hardware constraints and the potential loss of critical details caused by downsampling, which is the intuitive solution to hardware constraints. To address the above limitations, we introduce T3D, the first VLP framework designed for high-resolution 3D medical images. T3D incorporates two text-informed pretext tasks: (\lowerromannumeral{1}) text-informed contrastive learning; (\lowerromannumeral{2}) text-informed image restoration. These tasks focus on learning 3D visual representations from high-resolution 3D medical images and integrating clinical knowledge from radiology reports, without distorting information through forced alignment of downsampled volumes with detailed anatomical text. Trained on a newly curated large-scale dataset of 3D medical images and radiology reports, T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification. This underlines T3D's potential in representation learning for 3D medical image analysis. All data and code will be available upon acceptance.

Computer Vision and Pattern Recognition,Computation and Language,Machine Learning,Image and Video Processing

What problem does this paper attempt to address?

The paper aims to address the following issues: In 3D medical image analysis, expert annotation of 3D medical images required for downstream analysis is a resource-intensive task, posing challenges for clinical applications. Although visual self-supervised learning (vSSL) excels in learning visual invariance, it neglects to incorporate domain knowledge from the medical field. Additionally, existing visual language pre-training (VLP) methods are often constrained by GPU hardware limitations when applied to high-resolution 3D medical images, and downsampling may lead to the loss of critical details, making them impractical. To address these issues, the paper proposes T3D—the first VLP framework specifically designed for high-resolution 3D medical images. T3D learns visual representations through two text-guided pre-training tasks: (1) text-guided contrastive learning; (2) text-guided image reconstruction. These tasks aim to learn visual representations from high-resolution 3D medical images and incorporate clinical knowledge from radiology reports without distorting information through forced alignment between downsampled volumes and detailed anatomical texts. By training on a new dataset containing a large number of 3D medical images and radiology reports, T3D significantly outperforms current vSSL methods in tasks such as organ segmentation, tumor segmentation, and disease classification, demonstrating its potential in 3D medical image analysis.

T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training

MsVRL: Self-Supervised Multiscale Visual Representation Learning Via Cross-Level Consistency for Medical Image Segmentation

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training

G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

3D Self-Supervised Methods for Medical Imaging

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation

A General Global and Local Pre-Training Framework for 3D Medical Image Segmentation.

An OpenMind for 3D medical vision self-supervised learning

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Multiview Long-Short Spatial Contrastive Learning For 3D Medical Image Analysis

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

3DTINC: Time-Equivariant Non-Contrastive Learning for Predicting Disease Progression From Longitudinal OCTs

Medical Vision-Language Pre-Training for Brain Abnormalities

SELF-SUPERVISED LEARNING WITH RADIOLOGY REPORTS, A COMPARATIVE ANALYSIS OF STRATEGIES FOR LARGE VESSEL OCCLUSION AND BRAIN CTA IMAGES