Abstract:Background and Objectives: Transformers using self-attention mechanisms have recently advanced medical imaging by modeling long-range semantic dependencies, though they lack CNNs’ ability to capture local spatial details. This study introduced a novel segmentation network derived from a mixed CNN-Transformer (MixFormer) feature extraction backbone to enhance medical image segmentation. Method: The MixFormer network seamlessly integrates global and local information from Transformer and CNN architectures during the downsampling process. To comprehensively capture the inter-scale perspective, we introduced a Multi-scale Spatial-aware Fusion (MSAF) module, enabling effective interaction between coarse and fine feature representations. Additionally, we proposed a Mixed Multi-branch Dilated Attention (MMDA) module to bridge the semantic gap between encoding and decoding stages while emphasizing specific regions. Lastly, we implemented a CNN-based upsampling approach to recover low-level features, substantially improving segmentation accuracy. Results: Experimental validations on prevalent medical image datasets demonstrated the superior performance of MixFormer. On the Synapse dataset, our approach achieved a mean Dice Similarity Coefficient (DSC) of 82.64% and a mean Hausdorff Distance (HD) of 12.67 mm. On the ACDC dataset, the DSC was 91.01%. On the ISIC 2018 dataset, the model achieved a mean Intersection over Union (mIOU) of 0.841, Accuracy of 0.958, Precision of 0.910, Recall of 0.934, and an F1 score of 0.913. For the Kvasir-SEG dataset, we recorded a mean Dice of 0.9247, mIOU of 0.8615, Precision of 0.9181, and Recall of 0.9463. On the CVC-ClinicDB dataset, the results were a mean Dice of 0.9441, mIOU of 0.8922, Precision of 0.9437, and Recall of 0.9458. Conclusion: These findings underscore the superior segmentation performance of MixFormer compared to most mainstream segmentation networks such as CNNs and other Transformerbased structures.

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

CRmix: A Regularization by Clipping Images and Replacing Mixed Samples for Imbalanced Classification

PointCutMix: Regularization strategy for point cloud classification

Enhanced Long-Tailed Recognition with Contrastive CutMix Augmentation

LMix:Regularization Strategy for Convolutional Neural Networks

PatchMix: patch-level mixup for data augmentation in convolutional neural networks

LGCOAMix: Local and Global Context-and-Object-Part-Aware Superpixel-Based Data Augmentation for Deep Visual Recognition

LMix: Regularization Strategy for Convolutional Neural Networks.

ResizeMix: Mixing Data with Preserved Object Information and True Labels

RecursiveMix: Mixed Learning with History

Catch-Up Mix: Catch-Up Class for Struggling Filters in CNN

Provable Benefit of Cutout and CutMix for Feature Learning

SpliceMix: A Cross-scale and Semantic Blending Augmentation Strategy for Multi-label Image Classification.

RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks

ITMix: Image-Text Mix Augmentation for Transferring CLIP to Image Classification

FMixCutMatch for Semi-Supervised Deep Learning.

Saliency detection-guided for image data augmentation

SMMix: Self-Motivated Image Mixing for Vision Transformers

ColMix -- A Simple Data Augmentation Framework to Improve Object Detector Performance and Robustness in Aerial Images

RobustMixGen: Data augmentation for enhancing robustness of visual-language models in the presence of distribution shift