A More Design-Flexible Medical Transformer for Volumetric Image Segmentation.

Xin You,Yun Gu,Junjun He,Hui Sun,Jie Yang
DOI: https://doi.org/10.1007/978-3-031-21014-3_7
2022-01-01
Abstract:UNet-based encoder-decoder networks dominate volumetric medical image segmentation in the past several years. Many improvements focus on the design of encoders, decoders and skip connections. Due to the intrinsic property of convolutional kernels, convolution-based encoders suffer from limited receptive fields. To deal with that, recently proposed Transformer-based networks leveraging the self-attention mechanism build long-range dependency. However, they are highly reliable on pretrained weights from natural images. In our work, we find out ViT-based (Vision Transformer) models' performance will not decrease significantly without pretrained weights even if there is a limited data source. So we flexibly design a 3D medical Transformer for image segmentation and train it from scratch. Specifically, we introduce Multi-Scale Dynamic Positional Embeddings to ViT to dynamically acquire positional information of each 3D patch. Positional bias can also enrich attention diversities. Moreover, we give detailed reasons why we choose the convolution-based decoder instead of recently proposed Swin Transformer blocks after preliminary experiments on the decoder design. Finally, we propose the Context Enhancement Module to refine skipped features by merging low and high-frequency information via a combination of convolutional kernels and self-attention modules. Experiments show that our model is comparable to nnUNet on segmentation performance of Medical Segmentation Decathlon (Liver) and VerSe'20 datasets when trained from scratch.
What problem does this paper attempt to address?