Abstract:Automated medical image segmentation can assist doctors to diagnose faster and more accurate. Deep learning based models for medical image segmentation have made great progress in recent years. However, the existing models fail to effectively leverage Transformer and MLP for improving U-shaped architecture efficiently. In addition, the multi-scale features of the MLP have not been fully extracted in the bottleneck of U-shaped architecture. In this paper, we propose an efficient U-shaped architecture based on Swin Transformer and multi-scale MLP, namely STM-UNet. Specifically, the Swin Transformer block is added to skip connection of STM-UNet in form of residual connection, which can enhance the modeling ability of global features and long-range dependency. Meanwhile, a novel PCAS-MLP with parallel convolution module is designed and placed into the bottleneck of our architecture to contribute to the improvement of segmentation performance. The experimental results on ISIC 2016 and ISIC 2018 demonstrate the effectiveness of our proposed method. Our method also outperforms several state-of-the-art methods in terms of IoU and Dice. Our method has achieved a better trade-off between high segmentation accuracy and low model complexity.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the failure of existing medical image segmentation models to effectively combine the advantages of Convolutional Neural Networks (CNN), Transformers, and multi-scale multi-layer perceptrons (MLP), especially in U-shaped architectures where these modules' characteristics are not fully utilized to enhance segmentation performance. Specifically: 1. **Existing models fail to effectively integrate global and local features**: Current medical image segmentation models do not effectively combine the strengths of Transformers and MLPs, particularly in U-shaped architectures where the Transformer’s ability to model global features and long-range dependencies is not fully leveraged. 2. **Insufficient multi-scale feature extraction**: In the bottleneck part of the U-shaped architecture, existing models fail to fully extract the multi-scale features of MLPs, resulting in underutilized classification capabilities. 3. **High model complexity**: Most models based on CNNs and Transformers have high complexity, making them unsuitable for deployment on mobile devices for training or inference. In certain specific tasks (such as skin lesion segmentation), increasing model complexity does not further improve segmentation accuracy. To address the above issues, the paper proposes a new U-shaped architecture—STM-UNet, which improves existing models in the following ways: - **Adding Swin Transformer blocks in skip connections**: Swin Transformer blocks are added in the skip connections of the U-shaped architecture in the form of residual connections, enhancing the ability to model global features and long-range dependencies. - **Designing the PCAS-MLP module**: A new module—PCAS-MLP is introduced in the bottleneck part of the U-shaped architecture, which extracts multi-scale features through parallel convolution modules, thereby improving pixel classification capabilities. - **Balancing segmentation accuracy and model complexity**: STM-UNet not only outperforms several state-of-the-art methods in segmentation accuracy but also maintains lower model complexity, making it suitable for deployment on mobile devices. With these improvements, experimental results on the ISIC 2016 and ISIC 2018 datasets show that STM-UNet outperforms various existing methods in terms of Intersection over Union (IoU) and Dice coefficient, demonstrating its effectiveness and advancement.

STM-UNet: An Efficient U-shaped Architecture Based on Swin Transformer and Multi-scale MLP for Medical Image Segmentation

Mixed Transformer U-Net for Medical Image Segmentation

TF-Unet:An Automatic Cardiac MRI Image Segmentation Method

ST-Unet: Swin Transformer boosted U-Net with Cross-Layer Feature Enhancement for medical image segmentation

DS-TransUNet:Dual Swin Transformer U-Net for Medical Image Segmentation

SAttisUNet: UNet-like Swin Transformer with Attentive Skip Connections for Enhanced Medical Image Segmentation

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Swin-TransUper: Swin Transformer-based UperNet for medical image segmentation

DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation

MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method

MM-UNet: A Mixed MLP Architecture for Improved Ophthalmic Image Segmentation

FTUNet: A Feature-Enhanced Network for Medical Image Segmentation Based on the Combination of U-Shaped Network and Vision Transformer

MSCT-UNET: multi-scale contrastive transformer within U-shaped network for medical image segmentation

xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

SSTrans-Net: Smart Swin Transformer Network for medical image segmentation

MSR-UNet: enhancing multi-scale and long-range dependencies in medical image segmentation

MS-UNet-v2: Adaptive Denoising Method and Training Strategy for Medical Image Segmentation with Small Training Data

Sfe-Transunet: A Transformer-Based U-Net With Skipped Features Enhancer For Medical Image Segmentation

SC-UneXt: Nested UNeXt Architecture based on Medical Image Segmentation

SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical Image Segmentation

TransU²-Net: An Effective Medical Image Segmentation Framework Based on Transformer and U²-Net