Large-Scale 3D Medical Image Pre-training with Geometric Context Priors

Linshan Wu,Jiaxin Zhuang,Hao Chen
2024-10-13
Abstract:The scarcity of annotations poses a significant challenge in medical image analysis. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. Specifically, we (1) introduce the largest medical pre-training dataset PreCT-160K; (2) investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; (3) build a benchmark encompassing 48 medical tasks. Extensive experiments highlight the superiority of VoCo. Codes at <a class="link-external link-https" href="https://github.com/Luffy03/Large-Scale-Medical" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of annotation scarcity in 3D medical image analysis. Specifically, it attempts to overcome the following challenges through large - scale pre - training: 1. **Annotation Scarcity**: Medical image analysis requires a large amount of annotated data, and these annotations are usually provided by radiologists, which is both time - consuming and expensive, especially for high - dimensional 3D medical images. 2. **Utilization of Large - scale Unlabeled Data**: How to learn high - level semantic representations from a large amount of unlabeled data without annotations is an important current challenge. Existing self - supervised learning methods mainly rely on low - level information reconstruction and are difficult to capture high - level semantics. 3. **Utilization of Geometric Context Priors**: There are consistent geometric relationships between different organs in 3D medical images, and these relationships can be used as effective prior knowledge for learning consistent representations. To solve these problems, the authors propose a new framework named Volume Contrast (VoCo). The main contributions of VoCo include: - **Introducing Geometric Context Priors**: By comparing the similarity between randomly cropped regions and base - cropped regions and predicting their spatial positions, the geometric context is implicitly encoded into the model representation. - **Large - scale Datasets and Models**: The largest medical image pre - training dataset PreCT - 160K has been constructed so far, containing 160K CT volumes (42M slices), covering a variety of anatomical structures. At the same time, the expansion laws of different - scale models are explored, and guidelines for adjusting the model size for different tasks are provided. - **Comprehensive Benchmarking**: A comprehensive benchmark covering 48 downstream tasks (including segmentation, classification, registration, and vision - language) is established to evaluate the effectiveness of the pre - trained model. - **Fully - supervised Learning Framework**: Combining self - supervised and semi - supervised learning, making full use of labeled and unlabeled data, a fully - supervised pre - training framework is proposed. Through these methods, VoCo significantly improves the performance on datasets with limited annotated data and accelerates the fine - tuning convergence.