STA-Unet: Rethink the semantic redundant for Medical Imaging Segmentation

Vamsi Krishna Vasa,Wenhui Zhu,Xiwen Chen,Peijie Qiu,Xuanzhao Dong,Yalin Wang
2024-10-13
Abstract:In recent years, significant progress has been made in the medical image analysis domain using convolutional neural networks (CNNs). In particular, deep neural networks based on a U-shaped architecture (UNet) with skip connections have been adopted for several medical imaging tasks, including organ segmentation. Despite their great success, CNNs are not good at learning global or semantic features. Especially ones that require human-like reasoning to understand the context. Many UNet architectures attempted to adjust with the introduction of Transformer-based self-attention mechanisms, and notable gains in performance have been noted. However, the transformers are inherently flawed with redundancy to learn at shallow layers, which often leads to an increase in the computation of attention from the nearby pixels offering limited information. The recently introduced Super Token Attention (STA) mechanism adapts the concept of superpixels from pixel space to token space, using super tokens as compact visual representations. This approach tackles the redundancy by learning efficient global representations in vision transformers, especially for the shallow layers. In this work, we introduce the STA module in the UNet architecture (STA-UNet), to limit redundancy without losing rich information. Experimental results on four publicly available datasets demonstrate the superiority of STA-UNet over existing state-of-the-art architectures in terms of Dice score and IOU for organ segmentation tasks. The code is available at \url{<a class="link-external link-https" href="https://github.com/Retinal-Research/STA-UNet" rel="external noopener nofollow">this https URL</a>}.
Image and Video Processing,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the redundancy issues present in Convolutional Neural Networks (CNNs) and Transformer-based UNet architectures in medical image segmentation. Specifically: 1. **Limitations of CNNs**: Although CNNs have made significant progress in medical image analysis, especially in organ segmentation tasks, they perform poorly in learning global or semantic features. CNNs are particularly inadequate for tasks that require human reasoning to understand context. 2. **Limitations of Transformers**: While the introduction of self-attention mechanisms in Transformers has somewhat improved the performance of CNNs, there is a redundancy issue in shallow learning. This increases the computational load when calculating attention from neighboring pixels but provides limited information. 3. **Redundancy Issue**: Existing research has not sufficiently explored or addressed this redundancy issue. The authors' preliminary analysis reveals that in the shallow modules of Transformer-based UNet architectures, there is a high similarity between different blocks. This indicates that the model exhibits a lazy learning pattern in these layers, failing to effectively capture and encode complex contextual information. ### Solution To address the above issues, the authors propose **STA-UNet**, which integrates the **Super Token Attention (STA)** module into the UNet architecture. The STA module reduces redundancy and retains rich semantic information through the following methods: 1. **Super Token Attention (STA) Module**: The STA module extends the concept of superpixels from pixel space to token space, using super tokens as compact visual representations. This approach reduces redundancy by learning efficient global representations, especially in shallow layers. 2. **Multi-Head Setting**: The STA module employs a multi-head setting, increasing the number of attention heads to capture more regional information and determine its importance relative to decision-making. 3. **Parameter Optimization**: The authors conducted ablation studies to explore the impact of different token sizes and the number of attention heads on model performance, ultimately selecting the optimal parameter configuration. ### Experimental Results The authors conducted experiments on four public datasets, including Synapse multi-organ segmentation, Automated Cardiac Diagnosis Challenge (ACDC), nuclear segmentation (MoNuSeg), and gland segmentation in colon tissue slices (GlaS). The experimental results show that STA-UNet outperforms existing state-of-the-art methods in metrics such as Dice coefficient and Intersection over Union (IOU), particularly excelling in organ segmentation tasks. ### Main Contributions 1. **Revealing the Redundancy Issue**: The authors highlight the redundancy issue in the shallow modules of Transformer-based UNet architectures, promoting further research in this area. 2. **Integrating the STA Module**: The authors integrate the STA module into the UNet architecture, reducing the redundancy observed in other Transformer-based UNet models while retaining rich semantic information. 3. **Experimental Validation**: Through comprehensive evaluations on multiple datasets, the authors demonstrate the superiority of the proposed method in organ segmentation tasks.