Abstract:The generative self-supervised learning strategy exhibits remarkable learning representational capabilities. However, there is limited attention to end-to-end pre-training methods based on a hybrid architecture of CNN and Transformer, which can learn strong local and global representations simultaneously. To address this issue, we propose a generative pre-training strategy called Hybrid Sparse masKing (HySparK) based on masked image modeling and apply it to large-scale pre-training on medical images. First, we perform a bottom-up 3D hybrid masking strategy on the encoder to keep consistency masking. Then we utilize sparse convolution for the top CNNs and encode unmasked patches for the bottom vision Transformers. Second, we employ a simple hierarchical decoder with skip-connections to achieve dense multi-scale feature reconstruction. Third, we implement our pre-training method on a collection of multiple large-scale 3D medical imaging datasets. Extensive experiments indicate that our proposed pre-training strategy demonstrates robust transfer-ability in supervised downstream tasks and sheds light on HySparK's promising prospects. The code is available at <a class="link-external link-https" href="https://github.com/FengheTan9/HySparK" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: on large - scale unlabeled medical images, how to perform end - to - end pre - training on the hybrid architecture of convolutional neural networks (CNN) and Transformer through self - supervised learning methods, in order to simultaneously learn strong local and global representations and improve the transfer ability in downstream tasks. Specifically, the existing generative self - supervised learning methods (such as MAE) mainly focus on a single architecture (such as pure Transformer), ignoring the learning potential of the hybrid architecture (CNN + Transformer). In addition, the mask consistency problem and multi - scale feature learning problem in the hybrid architecture have not been effectively solved yet. These problems lead to "pixel distribution shift" and "mask pattern disappearance", thus affecting the performance and transfer ability of the model. For this reason, the paper proposes HySparK (Hybrid Sparse masKing), a generative pre - training strategy based on masked - image modeling, aiming to solve the above problems and make full use of the advantages of the hybrid architecture. The main contributions of HySparK include: 1. **Proposing a generative self - supervised learning method**, which realizes the end - to - end pre - training of the hybrid architecture for the first time and can integrate local and global representations simultaneously. 2. **Designing a bottom - up 3D hybrid masking strategy**, ensuring the mask consistency between different architectures and avoiding the data distribution shift problem. 3. **Utilizing sparse convolution and skip connections**, maintaining the mask consistency in the encoding stage, and realizing multi - scale feature reconstruction through skip connections in the decoding stage, enhancing the transfer ability of the model. Through these innovations, HySparK significantly outperforms existing methods in multiple segmentation downstream tasks, especially in dealing with small - organ segmentation, proving its strong multi - scale representation ability and transfer ability.

HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training

Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning

HybridMIM: A Hybrid Masked Image Modeling Framework for 3D Medical Image Segmentation

Self-supervised pre-training with contrastive and masked autoencoder methods for dealing with small datasets in deep learning for medical imaging

Representation Recovering for Self-Supervised Pre-training on Medical Images.

Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling

GMIM: Self-supervised pre-training for 3D medical image segmentation with adaptive and hierarchical masked image modeling

Keypoint-Augmented Self-Supervised Learning for Medical Image Segmentation with Limited Annotation

MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Multi-level Asymmetric Contrastive Learning for Volumetric Medical Image Segmentation Pre-training

Pre-training on High Definition X-ray Images: An Experimental Study

Masked Feature Modeling for Generative Self-Supervised Representation Learning of High-Resolution Remote Sensing Images

AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking

A Unified Visual Information Preservation Framework for Self-supervised Pre-training in Medical Image Analysis

MCPA: Multi-scale Cross Perceptron Attention Network for 2D Medical Image Segmentation

MambaMIM: Pre-training Mamba with State Space Token-interpolation

Large-Scale 3D Medical Image Pre-training with Geometric Context Priors

MFHARFNet: Multi-branch feature hybrid and adaptive receptive field network for image segmentation

Advancing Volumetric Medical Image Segmentation via Global-Local Masked Autoencoder

Multi-modal Masked Siamese Network Improves Chest X-Ray Representation Learning

Multi-scale Transformer Network with Edge-aware Pre-training for Cross-Modality MR Image Synthesis