Abstract:Background and Objectives: Transformers using self-attention mechanisms have recently advanced medical imaging by modeling long-range semantic dependencies, though they lack CNNs’ ability to capture local spatial details. This study introduced a novel segmentation network derived from a mixed CNN-Transformer (MixFormer) feature extraction backbone to enhance medical image segmentation. Method: The MixFormer network seamlessly integrates global and local information from Transformer and CNN architectures during the downsampling process. To comprehensively capture the inter-scale perspective, we introduced a Multi-scale Spatial-aware Fusion (MSAF) module, enabling effective interaction between coarse and fine feature representations. Additionally, we proposed a Mixed Multi-branch Dilated Attention (MMDA) module to bridge the semantic gap between encoding and decoding stages while emphasizing specific regions. Lastly, we implemented a CNN-based upsampling approach to recover low-level features, substantially improving segmentation accuracy. Results: Experimental validations on prevalent medical image datasets demonstrated the superior performance of MixFormer. On the Synapse dataset, our approach achieved a mean Dice Similarity Coefficient (DSC) of 82.64% and a mean Hausdorff Distance (HD) of 12.67 mm. On the ACDC dataset, the DSC was 91.01%. On the ISIC 2018 dataset, the model achieved a mean Intersection over Union (mIOU) of 0.841, Accuracy of 0.958, Precision of 0.910, Recall of 0.934, and an F1 score of 0.913. For the Kvasir-SEG dataset, we recorded a mean Dice of 0.9247, mIOU of 0.8615, Precision of 0.9181, and Recall of 0.9463. On the CVC-ClinicDB dataset, the results were a mean Dice of 0.9441, mIOU of 0.8922, Precision of 0.9437, and Recall of 0.9458. Conclusion: These findings underscore the superior segmentation performance of MixFormer compared to most mainstream segmentation networks such as CNNs and other Transformerbased structures.

MixFuse: an Iterative Mix-Attention Transformer for Multi-Modal Image Fusion

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

CMFuse: Cross-Modal Features Mixing Via Convolution and MLP for Infrared and Visible Image Fusion

TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

Trans2Fuse: Empowering image fusion through self-supervised learning and multi-modal transformations via transformer networks

MAMFuse: Multi-modality Image Fusion with Multiscale Attention Mechanism

Image Fusion Transformer

MACTFusion: Lightweight Cross Transformer for Adaptive Multimodal Medical Image Fusion

ITFuse: an Interactive Transformer for Infrared and Visible Image Fusion

TMFIF:Transformer-based Multi-Focus Image Fusion

FuseFormer: A Transformer for Visual and Thermal Image Fusion

Multimodal Image Fusion based on Hybrid CNN-Transformer and Non-local Cross-modal Attention

Multimodal Token Fusion for Vision Transformers

Rethinking Cross-Attention for Infrared and Visible Image Fusion

THFuse: An Infrared and Visible Image Fusion Network using Transformer and Hybrid Feature Extractor

A Cross-scale Iterative Attentional Adversarial Fusion Network for Infrared and Visible Images

TUFusion: A Transformer-based Universal Fusion Algorithm for Multimodal Images

TransMix: Attend to Mix for Vision Transformers

SMMix: Self-Motivated Image Mixing for Vision Transformers

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach