Abstract:Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address several key challenges in 3D medical image segmentation tasks: 1. **Capturing geometric shape information**: Existing Masked Autoencoders (MAEs) methods cannot capture geometric shape information well during the pre - training process, which is crucial in medical image segmentation. For example, existing methods often ignore the shape information of the overall object when reconstructing missing pixel/voxel information (as shown in Figure 1). 2. **Exploring global spatial information**: Existing MAE methods mainly focus on reconstructing information from locally occluded sub - volumes and may overlook the overall global context information of the target object. 3. **Compatibility with other common medical image segmentation architectures**: Existing MAE pre - training strategies are mainly developed based on the Vision Transformer (ViT) architecture, limiting their adaptability and effectiveness in other architectures (such as those based on Convolutional Neural Networks (CNN) or hybrid models). To solve these problems, the authors propose a new extended MAE method for self - supervised pre - training of 3D medical image segmentation. Specifically, this method includes the following aspects: 1. **Topological loss**: Extract geometric shape information by calculating the topological features of the input and reconstructed volumes. This method uses cubical complexes to calculate topological signatures and adopts the optimal transport distance (2 - Wasserstein distance) to define a new topological loss. 2. **Pre - text task**: Introduce a pre - text task to predict the positions of the center and eight corner points of the 3D cropped region, enabling the model to aggregate spatial information. 3. **Extension to hybrid architectures**: Extend the MAE pre - training strategy to hybrid state - of - the - art (SOTA) medical image segmentation architectures (such as UNETR++), and co - pre - train with ViT. 4. **Fine - tuning the model for downstream tasks**: Build a fine - tuning model by combining the pre - trained ViT encoder and the UNETR++ model to improve the performance of downstream segmentation tasks. Through these improvements, the authors conducted extensive experiments on five publicly available 3D segmentation datasets to verify the effectiveness of the new method.

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

Advancing Volumetric Medical Image Segmentation via Global-Local Masked Autoencoder

MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Medical supervised masked autoencoders: Crafting a better masking strategy and efficient fine-tuning schedule for medical image classification

GMIM: Self-supervised pre-training for 3D medical image segmentation with adaptive and hierarchical masked image modeling

Revisiting MAE pre-training for 3D medical image segmentation

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

Tissue-Contrastive Semi-Masked Autoencoders for Segmentation Pretraining on Chest CT

Tube Masking-Based MAE Pre-Training for Three-Dimensional Lumbar Vertebrae Segmentation

Mapping medical image-text to a joint space via masked modeling

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

HybridMIM: A Hybrid Masked Image Modeling Framework for 3D Medical Image Segmentation

3D Masked Autoencoders with Application to Anomaly Detection in Non-Contrast Enhanced Breast MRI

SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation

GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training

Unleashing the Potential of Vision-Language Pre-Training for 3D Zero-Shot Lesion Segmentation via Mask-Attribute Alignment

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

AMAES: Augmented Masked Autoencoder Pretraining on Public Brain MRI Data for 3D-Native Segmentation

Masked autoencoders are effective solution to transformer data-hungry