Abstract:Background and Objectives: Transformers using self-attention mechanisms have recently advanced medical imaging by modeling long-range semantic dependencies, though they lack CNNs’ ability to capture local spatial details. This study introduced a novel segmentation network derived from a mixed CNN-Transformer (MixFormer) feature extraction backbone to enhance medical image segmentation. Method: The MixFormer network seamlessly integrates global and local information from Transformer and CNN architectures during the downsampling process. To comprehensively capture the inter-scale perspective, we introduced a Multi-scale Spatial-aware Fusion (MSAF) module, enabling effective interaction between coarse and fine feature representations. Additionally, we proposed a Mixed Multi-branch Dilated Attention (MMDA) module to bridge the semantic gap between encoding and decoding stages while emphasizing specific regions. Lastly, we implemented a CNN-based upsampling approach to recover low-level features, substantially improving segmentation accuracy. Results: Experimental validations on prevalent medical image datasets demonstrated the superior performance of MixFormer. On the Synapse dataset, our approach achieved a mean Dice Similarity Coefficient (DSC) of 82.64% and a mean Hausdorff Distance (HD) of 12.67 mm. On the ACDC dataset, the DSC was 91.01%. On the ISIC 2018 dataset, the model achieved a mean Intersection over Union (mIOU) of 0.841, Accuracy of 0.958, Precision of 0.910, Recall of 0.934, and an F1 score of 0.913. For the Kvasir-SEG dataset, we recorded a mean Dice of 0.9247, mIOU of 0.8615, Precision of 0.9181, and Recall of 0.9463. On the CVC-ClinicDB dataset, the results were a mean Dice of 0.9441, mIOU of 0.8922, Precision of 0.9437, and Recall of 0.9458. Conclusion: These findings underscore the superior segmentation performance of MixFormer compared to most mainstream segmentation networks such as CNNs and other Transformerbased structures.

SF-SegFormer: Stepped-Fusion Segmentation Transformer for Brain Tissue Image Via Inter-Group Correlation and Enhanced Multi-layer Perceptron

Mmformer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation

SegCoFusion: an Integrative Multimodal Volumetric Segmentation Cooperating with Fusion Pipeline to Enhance Lesion Awareness.

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

STF-Net: sparsification transformer coding guided network for subcortical brain structure segmentation

A multi-path adaptive fusion network for multimodal brain tumor segmentation

3D Brainformer: 3D Fusion Transformer for Brain Tumor Segmentation

TransSea: Hybrid CNN-Transformer with Semantic Awareness for 3D Brain Tumor Segmentation

A conflict-free multi-modal fusion network with spatial reinforcement transformers for brain tumor segmentation

Dual encoder network with transformer-CNN for multi-organ segmentation

TransSea: Hybrid CNN–Transformer With Semantic Awareness for 3-D Brain Tumor Segmentation

ConvFormer: Combining CNN and Transformer for Medical Image Segmentation

Bi-Fusion of Structure and Deformation at Multi-Scale for Joint Segmentation and Registration

Hybrid-Fusion Transformer for Multisequence MRI

DCFNet: An Effective Dual-Branch Cross-Attention Fusion Network for Medical Image Segmentation

Sub-pixel multi-scale fusion network for medical image segmentation

SACNet: A Spatially Adaptive Convolution Network for 2D Multi-organ Medical Segmentation

nnSegNeXt: A 3D Convolutional Network for Brain Tissue Segmentation Based on Quality Evaluation

FD-FCN: 3D Fully Dense and Fully Convolutional Network for Semantic Segmentation of Brain Anatomy

Feature fusion and latent feature learning guided brain tumor segmentation and missing modality recovery network

MM-BiFPN: Multi-Modality Fusion Network With Bi-FPN for MRI Brain Tumor Segmentation