HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training

Fenghe Tang,Ronghao Xu,Qingsong Yao,Xueming Fu,Quan Quan,Heqin Zhu,Zaiyi Liu,S. Kevin Zhou
2024-08-12
Abstract:The generative self-supervised learning strategy exhibits remarkable learning representational capabilities. However, there is limited attention to end-to-end pre-training methods based on a hybrid architecture of CNN and Transformer, which can learn strong local and global representations simultaneously. To address this issue, we propose a generative pre-training strategy called Hybrid Sparse masKing (HySparK) based on masked image modeling and apply it to large-scale pre-training on medical images. First, we perform a bottom-up 3D hybrid masking strategy on the encoder to keep consistency masking. Then we utilize sparse convolution for the top CNNs and encode unmasked patches for the bottom vision Transformers. Second, we employ a simple hierarchical decoder with skip-connections to achieve dense multi-scale feature reconstruction. Third, we implement our pre-training method on a collection of multiple large-scale 3D medical imaging datasets. Extensive experiments indicate that our proposed pre-training strategy demonstrates robust transfer-ability in supervised downstream tasks and sheds light on HySparK's promising prospects. The code is available at <a class="link-external link-https" href="https://github.com/FengheTan9/HySparK" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: on large - scale unlabeled medical images, how to perform end - to - end pre - training on the hybrid architecture of convolutional neural networks (CNN) and Transformer through self - supervised learning methods, in order to simultaneously learn strong local and global representations and improve the transfer ability in downstream tasks. Specifically, the existing generative self - supervised learning methods (such as MAE) mainly focus on a single architecture (such as pure Transformer), ignoring the learning potential of the hybrid architecture (CNN + Transformer). In addition, the mask consistency problem and multi - scale feature learning problem in the hybrid architecture have not been effectively solved yet. These problems lead to "pixel distribution shift" and "mask pattern disappearance", thus affecting the performance and transfer ability of the model. For this reason, the paper proposes HySparK (Hybrid Sparse masKing), a generative pre - training strategy based on masked - image modeling, aiming to solve the above problems and make full use of the advantages of the hybrid architecture. The main contributions of HySparK include: 1. **Proposing a generative self - supervised learning method**, which realizes the end - to - end pre - training of the hybrid architecture for the first time and can integrate local and global representations simultaneously. 2. **Designing a bottom - up 3D hybrid masking strategy**, ensuring the mask consistency between different architectures and avoiding the data distribution shift problem. 3. **Utilizing sparse convolution and skip connections**, maintaining the mask consistency in the encoding stage, and realizing multi - scale feature reconstruction through skip connections in the decoding stage, enhancing the transfer ability of the model. Through these innovations, HySparK significantly outperforms existing methods in multiple segmentation downstream tasks, especially in dealing with small - organ segmentation, proving its strong multi - scale representation ability and transfer ability.