Abstract:Background and Objectives: Transformers using self-attention mechanisms have recently advanced medical imaging by modeling long-range semantic dependencies, though they lack CNNs’ ability to capture local spatial details. This study introduced a novel segmentation network derived from a mixed CNN-Transformer (MixFormer) feature extraction backbone to enhance medical image segmentation. Method: The MixFormer network seamlessly integrates global and local information from Transformer and CNN architectures during the downsampling process. To comprehensively capture the inter-scale perspective, we introduced a Multi-scale Spatial-aware Fusion (MSAF) module, enabling effective interaction between coarse and fine feature representations. Additionally, we proposed a Mixed Multi-branch Dilated Attention (MMDA) module to bridge the semantic gap between encoding and decoding stages while emphasizing specific regions. Lastly, we implemented a CNN-based upsampling approach to recover low-level features, substantially improving segmentation accuracy. Results: Experimental validations on prevalent medical image datasets demonstrated the superior performance of MixFormer. On the Synapse dataset, our approach achieved a mean Dice Similarity Coefficient (DSC) of 82.64% and a mean Hausdorff Distance (HD) of 12.67 mm. On the ACDC dataset, the DSC was 91.01%. On the ISIC 2018 dataset, the model achieved a mean Intersection over Union (mIOU) of 0.841, Accuracy of 0.958, Precision of 0.910, Recall of 0.934, and an F1 score of 0.913. For the Kvasir-SEG dataset, we recorded a mean Dice of 0.9247, mIOU of 0.8615, Precision of 0.9181, and Recall of 0.9463. On the CVC-ClinicDB dataset, the results were a mean Dice of 0.9441, mIOU of 0.8922, Precision of 0.9437, and Recall of 0.9458. Conclusion: These findings underscore the superior segmentation performance of MixFormer compared to most mainstream segmentation networks such as CNNs and other Transformerbased structures.

MedFCT: A Frequency Domain Joint CNN-Transformer Network for Semi-supervised Medical Image Segmentation

Combinatorial CNN-Transformer Learning with Manifold Constraints for Semi-supervised Medical Image Segmentation

Semi-Supervised Convolutional Vision Transformer with Bi-Level Uncertainty Estimation for Medical Image Segmentation

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

SEMI-CONTRANS: Semi-Supervised Medical Image Segmentation via Multi-Scale Feature Fusion and Cross Teaching of CNN and Transformer

Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students

Semi-Supervised Medical Image Segmentation Based on Deep Consistent Collaborative Learning

CFATransUnet: Channel-wise cross fusion attention and transformer for 2D medical image segmentation

MSCT-UNET: multi-scale contrastive transformer within U-shaped network for medical image segmentation

DCFNet: An Effective Dual-Branch Cross-Attention Fusion Network for Medical Image Segmentation

TFCNs: A CNN-Transformer Hybrid Network for Medical Image Segmentation

Sub-pixel multi-scale fusion network for medical image segmentation

FCTrans UNet: A Hybrid CNN and Transformer Model for Medical Image Segmentations

FCSU-Net: A novel full-scale Cross-dimension Self-attention U-Net with collaborative fusion of multi-scale feature for medical image segmentation

TC-Net: A joint learning framework based on CNN and vision transformer for multi-lesion medical images segmentation

Multi-dimensional Fusion and Consistency for Semi-supervised Medical Image Segmentation

UCTNet: Uncertainty-guided CNN-Transformer hybrid networks for medical image segmentation

MFH‐Net: A Hybrid CNN‐Transformer Network Based Multi‐Scale Fusion for Medical Image Segmentation

HTC-Net: A hybrid CNN-transformer framework for medical image segmentation

CASF-Net: Cross-attention and Cross-scale Fusion Network for Medical Image Segmentation