Abstract:Purpose Vision Transformers recently achieved a competitive performance compared with CNNs due to their excellent capability of learning global representation. However, there are two major challenges when applying them to 3D image segmentation: i) Because of the large size of 3D medical images, comprehensive global information is hard to capture due to the enormous computational costs. ii) Insufficient local inductive bias in Transformers affects the ability to segment detailed features such as ambiguous and subtly defined boundaries. Hence, to apply the Vision Transformer mechanism in the medical image segmentation field, the above challenges need to be overcome adequately. Methods We propose a hybrid paradigm, called Variable-Shape Mixed Transformer (VSmTrans), that integrates self-attention and convolution and can enjoy the benefits of free learning of both complex relationships from the self-attention mechanism and the local prior knowledge from convolution. Specifically, we designed a Variable-Shape self-attention mechanism, which can rapidly expand the receptive field without extra computing cost and achieve a good trade-off between global awareness and local details. In addition, the parallel convolution paradigm introduces strong local inductive bias to facilitate the ability to excavate details. Meanwhile, a pair of learnable parameters can automatically adjust the importance of the above two paradigms. Extensive experiments were conducted on two public medical image datasets with different modalities: the AMOS CT dataset and the BraTS2021 MRI dataset. Results Our method achieves the best average Dice scores of 88.3% and 89.7% on these datasets, which are superior to the previous state-of-the-art Swin Transformer-based and CNN-based architectures. A series of ablation experiments were also conducted to verify the efficiency of the proposed hybrid mechanism and the components and explore the effectiveness of those key parameters in VSmTrans. Conclusions The proposed hybrid Transformer-based backbone network for 3D medical image segmentation can tightly integrate self-attention and convolution to exploit the advantages of these two paradigms. The experimental results demonstrate our method's superiority compared to other state-of-the-art methods. The hybrid paradigm seems to be most appropriate to the medical image segmentation field. The ablation experiments also demonstrate that the proposed hybrid mechanism can effectively balance large receptive fields with local inductive biases, resulting in highly accurate segmentation results, especially in capturing details. Our code is available at https://github.com/qingze-bai/VSmTrans.

MAGIC: Rethinking Dynamic Convolution Design for Medical Image Segmentation

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

D2-MLP: Dynamic Decomposed MLP Mixer for Medical Image Segmentation

HC-Mamba: Vision MAMBA with Hybrid Convolutional Techniques for Medical Image Segmentation

CMUNeXt: An Efficient Medical Image Segmentation Network based on Large Kernel and Skip Fusion

MDC-RHT: Multi-Modal Medical Image Fusion via Multi-Dimensional Dynamic Convolution and Residual Hybrid Transformer

ConvMedSegNet:A multi-receptive field depthwise convolutional neural network for medical image segmentation

MambaClinix: Hierarchical Gated Convolution and Mamba-Based U-Net for Enhanced 3D Medical Image Segmentation

CiT-Net: Convolutional Neural Networks Hand in Hand with Vision Transformers for Medical Image Segmentation

Dynamic Group Convolution for Accelerating Convolutional Neural Networks

MAGnitude-Image-to-Complex K -space (MAGIC-K) Net: A Data Augmentation Network for Image Reconstruction.

Scaling Up 3D Kernels with Bayesian Frequency Re-parameterization for Medical Image Segmentation

TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for Medical Image Segmentation

How to Win a Dashed Line Detection Contest

CascadeMedSeg: integrating pyramid vision transformer with multi-scale fusion for precise medical image segmentation

D-Net: Dynamic Large Kernel with Dynamic Feature Fusion for Volumetric Medical Image Segmentation

MSA$^2$Net: Multi-scale Adaptive Attention-guided Network for Medical Image Segmentation

MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery

VSmTrans: A Hybrid Paradigm Integrating Self-attention and Convolution for 3D Medical Image Segmentation

PIS-Net: Efficient Medical Image Segmentation Network with Multivariate Downsampling for Point-of-Care

MpMsCFMA-Net: Multi-path Multi-scale Context Feature Mixup and Aggregation Network for medical image segmentation