STM-UNet: An Efficient U-shaped Architecture Based on Swin Transformer and Multi-scale MLP for Medical Image Segmentation

Lei Shi,Tianyu Gao,Zheng Zhang,Junxing Zhang
2023-04-25
Abstract:Automated medical image segmentation can assist doctors to diagnose faster and more accurate. Deep learning based models for medical image segmentation have made great progress in recent years. However, the existing models fail to effectively leverage Transformer and MLP for improving U-shaped architecture efficiently. In addition, the multi-scale features of the MLP have not been fully extracted in the bottleneck of U-shaped architecture. In this paper, we propose an efficient U-shaped architecture based on Swin Transformer and multi-scale MLP, namely STM-UNet. Specifically, the Swin Transformer block is added to skip connection of STM-UNet in form of residual connection, which can enhance the modeling ability of global features and long-range dependency. Meanwhile, a novel PCAS-MLP with parallel convolution module is designed and placed into the bottleneck of our architecture to contribute to the improvement of segmentation performance. The experimental results on ISIC 2016 and ISIC 2018 demonstrate the effectiveness of our proposed method. Our method also outperforms several state-of-the-art methods in terms of IoU and Dice. Our method has achieved a better trade-off between high segmentation accuracy and low model complexity.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem this paper attempts to address is the failure of existing medical image segmentation models to effectively combine the advantages of Convolutional Neural Networks (CNN), Transformers, and multi-scale multi-layer perceptrons (MLP), especially in U-shaped architectures where these modules' characteristics are not fully utilized to enhance segmentation performance. Specifically: 1. **Existing models fail to effectively integrate global and local features**: Current medical image segmentation models do not effectively combine the strengths of Transformers and MLPs, particularly in U-shaped architectures where the Transformer’s ability to model global features and long-range dependencies is not fully leveraged. 2. **Insufficient multi-scale feature extraction**: In the bottleneck part of the U-shaped architecture, existing models fail to fully extract the multi-scale features of MLPs, resulting in underutilized classification capabilities. 3. **High model complexity**: Most models based on CNNs and Transformers have high complexity, making them unsuitable for deployment on mobile devices for training or inference. In certain specific tasks (such as skin lesion segmentation), increasing model complexity does not further improve segmentation accuracy. To address the above issues, the paper proposes a new U-shaped architecture—STM-UNet, which improves existing models in the following ways: - **Adding Swin Transformer blocks in skip connections**: Swin Transformer blocks are added in the skip connections of the U-shaped architecture in the form of residual connections, enhancing the ability to model global features and long-range dependencies. - **Designing the PCAS-MLP module**: A new module—PCAS-MLP is introduced in the bottleneck part of the U-shaped architecture, which extracts multi-scale features through parallel convolution modules, thereby improving pixel classification capabilities. - **Balancing segmentation accuracy and model complexity**: STM-UNet not only outperforms several state-of-the-art methods in segmentation accuracy but also maintains lower model complexity, making it suitable for deployment on mobile devices. With these improvements, experimental results on the ISIC 2016 and ISIC 2018 datasets show that STM-UNet outperforms various existing methods in terms of Intersection over Union (IoU) and Dice coefficient, demonstrating its effectiveness and advancement.