T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training

Che Liu,Cheng Ouyang,Yinda Chen,Cesar César Quilodrán-Casas,Lei Ma,Jie Fu,Yike Guo,Anand Shah,Wenjia Bai,Rossella Arcucci
2023-12-05
Abstract:Expert annotation of 3D medical image for downstream analysis is resource-intensive, posing challenges in clinical applications. Visual self-supervised learning (vSSL), though effective for learning visual invariance, neglects the incorporation of domain knowledge from medicine. To incorporate medical knowledge into visual representation learning, vision-language pre-training (VLP) has shown promising results in 2D image. However, existing VLP approaches become generally impractical when applied to high-resolution 3D medical images due to GPU hardware constraints and the potential loss of critical details caused by downsampling, which is the intuitive solution to hardware constraints. To address the above limitations, we introduce T3D, the first VLP framework designed for high-resolution 3D medical images. T3D incorporates two text-informed pretext tasks: (\lowerromannumeral{1}) text-informed contrastive learning; (\lowerromannumeral{2}) text-informed image restoration. These tasks focus on learning 3D visual representations from high-resolution 3D medical images and integrating clinical knowledge from radiology reports, without distorting information through forced alignment of downsampled volumes with detailed anatomical text. Trained on a newly curated large-scale dataset of 3D medical images and radiology reports, T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification. This underlines T3D's potential in representation learning for 3D medical image analysis. All data and code will be available upon acceptance.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning,Image and Video Processing
What problem does this paper attempt to address?
The paper aims to address the following issues: In 3D medical image analysis, expert annotation of 3D medical images required for downstream analysis is a resource-intensive task, posing challenges for clinical applications. Although visual self-supervised learning (vSSL) excels in learning visual invariance, it neglects to incorporate domain knowledge from the medical field. Additionally, existing visual language pre-training (VLP) methods are often constrained by GPU hardware limitations when applied to high-resolution 3D medical images, and downsampling may lead to the loss of critical details, making them impractical. To address these issues, the paper proposes T3D—the first VLP framework specifically designed for high-resolution 3D medical images. T3D learns visual representations through two text-guided pre-training tasks: (1) text-guided contrastive learning; (2) text-guided image reconstruction. These tasks aim to learn visual representations from high-resolution 3D medical images and incorporate clinical knowledge from radiology reports without distorting information through forced alignment between downsampled volumes and detailed anatomical texts. By training on a new dataset containing a large number of 3D medical images and radiology reports, T3D significantly outperforms current vSSL methods in tasks such as organ segmentation, tumor segmentation, and disease classification, demonstrating its potential in 3D medical image analysis.