MaS-TransUNet: A Multi-Attention Swin Transformer U-Net for Medical Image Segmentation
Ashwini Kumar Upadhyay,Ashish Kumar Bhandari
DOI: https://doi.org/10.1109/trpms.2024.3477528
2024-01-01
IEEE Transactions on Radiation and Plasma Medical Sciences
Abstract:U-shaped encoder-decoder models have excelled in automatic medical image segmentation due to their hierarchical feature learning capabilities, robustness, and upgradability. Purely CNN-based models are excellent at extracting local details but struggle with long-range dependencies, whereas transformer-based models excel in global context modeling but have higher data and computational requirements. Self-attention-based transformers and other attention mechanisms have been shown to enhance segmentation accuracy in the encoder-decoder framework. Drawing from these challenges and opportunities, we propose a novel Multi-attention Swin Transformer U-Net (MaSTransUNet) model, incorporating self-attention, edge attention, channel attention, and feedback attention. MaS-TransUNet leverages the strengths of both CNNs and transformers within a U-shaped encoder-decoder framework. For self-attention, we developed modules using Swin Transformer blocks, offering hierarchical feature representations. We designed specialized modules, including an Edge Attention Module (EAM) to guide the network with edge information, a Feedback Attention Module (FAM) to utilize previous epoch segmentation masks for refining subsequent predictions, and a Channel Attention Module (CAM) to focus on relevant feature channels. We also introduced advanced data augmentation, regularizations, and an optimal training scheme for enhanced training. Comprehensive experiments across five diverse medical image segmentation datasets demonstrate that MaS-TransUNet significantly outperforms existing state-of-the-art methods while maintaining computational efficiency. It achieves the highest Dice scores of 0.903, 0.841, 0.908, 0.906, and 0.906 on TCGA-LGG Brain MRI, COVID-19 Lung CT, DSB-2018, Kvasir-SEG, and ISIC-2018 datasets, respectively. These results highlight the model’s robustness and versatility, consistently delivering exceptional performance without modality-specific adaptations.