SwinUNETR-V2: Stronger Swin Transformers with Stagewise Convolutions for 3D Medical Image Segmentation
Yufan He,Vishwesh Nath,Dong Yang,Yucheng Tang,Andriy Myronenko,Daguang Xu
DOI: https://doi.org/10.1007/978-3-031-43901-8_40
2023-01-01
Abstract:Transformers for medical image segmentation have attracted broad interest. Unlike convolutional networks (CNNs), transformers use self-attentions that do not have a strong inductive bias. This gives transformers the ability to learn long-range dependencies and stronger modeling capacities. Although they, e.g. SwinUNETR, achieve state-of-the-art (SOTA) results on some benchmarks, the lack of inductive bias makes transformers harder to train, requires much more training data, and are sensitive to training recipes. In many clinical scenarios and challenges, transformers can still have inferior performances than SOTA CNNs like nnUNet. A transformer backbone and corresponding training recipe, which can achieve top performances under different medical image segmentation scenarios, still needs to be developed. In this paper, we enhance the SwinUNETR with convolutions, which results in a surprisingly stronger backbone, the SwinUNETR-V2, for 3D medical image segmentation. It achieves top performance on a variety of benchmarks of different sizes and modalities, including the Whole abdominal ORgan Dataset (WORD), MICCAI FLARE2021 dataset, MSD pancreas dataset, MSD prostate dataset, and MSD lung cancer dataset, all using the same training recipe (https://github.com/Project-MONAI/researchcontributions/tree/main/SwinUNETR/BTCV, our training recipe is the same as that by SwinUNETR) with minimum changes across tasks.