Abstract:Background and Objective: Transformer, which is notable for its ability of global context modeling, has been used to remedy the shortcomings of Convolutional neural networks (CNN) and break its dominance in medical image segmentation. However, the self-attention module is both memory and computational inefficient, so many methods have to build their Transformer branch upon largely downsampled feature maps or adopt the tokenized image patches to fit their model into accessible GPUs. This patch-wise operation restricts the network in extracting pixel-level intrinsic structural or dependencies inside each patch, hurting the performance of pixel-level classification tasks. Methods: To tackle these issues, we propose a memory- and computation-efficient self-attention module to enable reasoning on relatively high-resolution features, promoting the efficiency of learning global information while effective grasping fine spatial details. Furthermore, we design a novel Multi-Branch Transformer (MultiTrans) architecture to provide hierarchical features for handling objects with variable shapes and sizes in medical images. By building four parallel Transformer branches on different levels of CNN, our hybrid network aggregates both multi-scale global contexts and multi-scale local features. Results: MultiTrans achieves the highest segmentation accuracy on three medical image datasets with different modalities: Synapse, ACDC and M&Ms. Compared to the Standard Self-Attention (SSA), the proposed Efficient Self-Attention (ESA) can largely reduce the training memory and computational complexity while even slightly improve the accuracy. Specifically, the training memory cost, FLOPs and Params of our ESA are 18.77%, 20.68% and 74.07% of the SSA. Conclusions: Experiments on three medical image datasets demonstrate the generality and robustness of the designed network. The ablation study shows the efficiency and effectiveness of our proposed ESA. Code is available at: https://github.com/Yanhua-Zhang/MultiTrans-extension .

MS-Twins: Multi-Scale Deep Self-Attention Networks for Medical Image Segmentation

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

Mixed Transformer U-Net for Medical Image Segmentation

MS-TCNet: An effective Transformer–CNN combined network using multi-scale feature learning for 3D medical image segmentation

MSCT-UNET: multi-scale contrastive transformer within U-shaped network for medical image segmentation

DS-TransUNet:Dual Swin Transformer U-Net for Medical Image Segmentation

DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation

MSR-UNet: enhancing multi-scale and long-range dependencies in medical image segmentation

MultiTrans: Multi-branch transformer network for medical image segmentation

MSMHSA-DeepLab V3+: An Effective Multi-Scale, Multi-Head Self-Attention Network for Dual-Modality Cardiac Medical Image Segmentation

TSCA-Net: Transformer based spatial-channel attention segmentation network for medical images

A Novel Deep Learning Model for Medical Image Segmentation with Convolutional Neural Network and Transformer

SCANeXt: Enhancing 3D Medical Image Segmentation with Dual Attention Network and Depth-Wise Convolution

UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

HMDA: A Hybrid Model with Multi-scale Deformable Attention for Medical Image Segmentation

DMSA-UNet: Dual Multi-Scale Attention makes UNet more strong for medical image segmentation

Sfe-Transunet: A Transformer-Based U-Net With Skipped Features Enhancer For Medical Image Segmentation

Isc-Transunet: Medical Image Segmentation Network Based On The Integration Of Self-Attention And Convolution

STA-Former: enhancing medical image segmentation with Shrinkage Triplet Attention in a hybrid CNN-Transformer model

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

MSA$^2$Net: Multi-scale Adaptive Attention-guided Network for Medical Image Segmentation