Abstract:Automatic medical image segmentation has made great progress benefit from the development of deep learning. However, most existing methods are based on convolutional neural networks (CNNs), which fail to build long-range dependencies and global context connections due to the limitation of receptive field in convolution operation. Inspired by the success of Transformer in modeling the long-range contextual information, some researchers have expended considerable efforts in designing the robust variants of Transformer-based U-Net. Moreover, the patch division used in vision transformers usually ignores the pixel-level intrinsic structural features inside each patch. To alleviate these problems, we propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet), which might be the first attempt to concurrently incorporate the advantages of hierarchical Swin Transformer into both encoder and decoder of the standard U-shaped architecture to enhance the semantic segmentation quality of varying medical images. Unlike many prior Transformer-based solutions, the proposed DS-TransUNet first adopts dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales. As the core component for our DS-TransUNet, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively establish global dependencies between features of different scales through the self-attention mechanism. Furthermore, we also introduce the Swin Transformer block into decoder to further explore the long-range contextual information during the up-sampling process. Extensive experiments across four typical tasks for medical image segmentation demonstrate the effectiveness of DS-TransUNet, and show that our approach significantly outperforms the state-of-the-art methods.

FDR-TransUNet: A novel encoder-decoder architecture with vision transformer for improved medical image segmentation

FTUNet: A Feature-Enhanced Network for Medical Image Segmentation Based on the Combination of U-Shaped Network and Vision Transformer

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

ViT-UperNet: a hybrid vision transformer with unified-perceptual-parsing network for medical image segmentation

Sfe-Transunet: A Transformer-Based U-Net With Skipped Features Enhancer For Medical Image Segmentation

LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation

A novel full-convolution UNet-transformer for medical image segmentation

UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation

FCTrans UNet: A Hybrid CNN and Transformer Model for Medical Image Segmentations

DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation

3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers

A Novel Deep Learning Model for Medical Image Segmentation with Convolutional Neural Network and Transformer

Focal-UNet: UNet-like Focal Modulation for Medical Image Segmentation

TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation

Context-aware and local-aware fusion with transformer for medical image segmentation

UNETR: Transformers for 3D Medical Image Segmentation

TSCA-Net: Transformer based spatial-channel attention segmentation network for medical images

DS-TransUNet:Dual Swin Transformer U-Net for Medical Image Segmentation

ConvWin-UNet: UNet-like hierarchical vision Transformer combined with convolution for medical image segmentation.

DECTNet: Dual Encoder Network combined convolution and Transformer architecture for medical image segmentation

MCV-UNet: a modified convolution & transformer hybrid encoder-decoder network with multi-scale information fusion for ultrasound image semantic segmentation