Abstract:The scarcity of annotations poses a significant challenge in medical image analysis. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. Specifically, we (1) introduce the largest medical pre-training dataset PreCT-160K; (2) investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; (3) build a benchmark encompassing 48 medical tasks. Extensive experiments highlight the superiority of VoCo. Codes at <a class="link-external link-https" href="https://github.com/Luffy03/Large-Scale-Medical" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of annotation scarcity in 3D medical image analysis. Specifically, it attempts to overcome the following challenges through large - scale pre - training: 1. **Annotation Scarcity**: Medical image analysis requires a large amount of annotated data, and these annotations are usually provided by radiologists, which is both time - consuming and expensive, especially for high - dimensional 3D medical images. 2. **Utilization of Large - scale Unlabeled Data**: How to learn high - level semantic representations from a large amount of unlabeled data without annotations is an important current challenge. Existing self - supervised learning methods mainly rely on low - level information reconstruction and are difficult to capture high - level semantics. 3. **Utilization of Geometric Context Priors**: There are consistent geometric relationships between different organs in 3D medical images, and these relationships can be used as effective prior knowledge for learning consistent representations. To solve these problems, the authors propose a new framework named Volume Contrast (VoCo). The main contributions of VoCo include: - **Introducing Geometric Context Priors**: By comparing the similarity between randomly cropped regions and base - cropped regions and predicting their spatial positions, the geometric context is implicitly encoded into the model representation. - **Large - scale Datasets and Models**: The largest medical image pre - training dataset PreCT - 160K has been constructed so far, containing 160K CT volumes (42M slices), covering a variety of anatomical structures. At the same time, the expansion laws of different - scale models are explored, and guidelines for adjusting the model size for different tasks are provided. - **Comprehensive Benchmarking**: A comprehensive benchmark covering 48 downstream tasks (including segmentation, classification, registration, and vision - language) is established to evaluate the effectiveness of the pre - trained model. - **Fully - supervised Learning Framework**: Combining self - supervised and semi - supervised learning, making full use of labeled and unlabeled data, a fully - supervised pre - training framework is proposed. Through these methods, VoCo significantly improves the performance on datasets with limited annotated data and accelerates the fine - tuning convergence.

Large-Scale 3D Medical Image Pre-training with Geometric Context Priors

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Geometric Visual Similarity Learning in 3D Medical Image Self-supervised Pre-training

Multi-level Asymmetric Contrastive Learning for Volumetric Medical Image Segmentation Pre-training

MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

MedContext: Learning Contextual Cues for Efficient Volumetric Medical Segmentation

MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training

Positional Contrastive Learning for Volumetric Medical Image Segmentation

GMIM: Self-supervised pre-training for 3D medical image segmentation with adaptive and hierarchical masked image modeling

Video Pretraining Advances 3D Deep Learning on Chest CT Tasks

Positional Information is a Strong Supervision for Volumetric Medical Image Segmentation

Contrastive Learning of Medical Visual Representations from Paired Images and Text

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

SegVol: Universal and Interactive Volumetric Medical Image Segmentation

3D Self-Supervised Methods for Medical Imaging

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

Self-Supervised Learning for Medical Image Data with Anatomy-Oriented Imaging Planes