Abstract:In recent years, significant progress has been made in the medical image analysis domain using convolutional neural networks (CNNs). In particular, deep neural networks based on a U-shaped architecture (UNet) with skip connections have been adopted for several medical imaging tasks, including organ segmentation. Despite their great success, CNNs are not good at learning global or semantic features. Especially ones that require human-like reasoning to understand the context. Many UNet architectures attempted to adjust with the introduction of Transformer-based self-attention mechanisms, and notable gains in performance have been noted. However, the transformers are inherently flawed with redundancy to learn at shallow layers, which often leads to an increase in the computation of attention from the nearby pixels offering limited information. The recently introduced Super Token Attention (STA) mechanism adapts the concept of superpixels from pixel space to token space, using super tokens as compact visual representations. This approach tackles the redundancy by learning efficient global representations in vision transformers, especially for the shallow layers. In this work, we introduce the STA module in the UNet architecture (STA-UNet), to limit redundancy without losing rich information. Experimental results on four publicly available datasets demonstrate the superiority of STA-UNet over existing state-of-the-art architectures in terms of Dice score and IOU for organ segmentation tasks. The code is available at \url{<a class="link-external link-https" href="https://github.com/Retinal-Research/STA-UNet" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the redundancy issues present in Convolutional Neural Networks (CNNs) and Transformer-based UNet architectures in medical image segmentation. Specifically: 1. **Limitations of CNNs**: Although CNNs have made significant progress in medical image analysis, especially in organ segmentation tasks, they perform poorly in learning global or semantic features. CNNs are particularly inadequate for tasks that require human reasoning to understand context. 2. **Limitations of Transformers**: While the introduction of self-attention mechanisms in Transformers has somewhat improved the performance of CNNs, there is a redundancy issue in shallow learning. This increases the computational load when calculating attention from neighboring pixels but provides limited information. 3. **Redundancy Issue**: Existing research has not sufficiently explored or addressed this redundancy issue. The authors' preliminary analysis reveals that in the shallow modules of Transformer-based UNet architectures, there is a high similarity between different blocks. This indicates that the model exhibits a lazy learning pattern in these layers, failing to effectively capture and encode complex contextual information. ### Solution To address the above issues, the authors propose **STA-UNet**, which integrates the **Super Token Attention (STA)** module into the UNet architecture. The STA module reduces redundancy and retains rich semantic information through the following methods: 1. **Super Token Attention (STA) Module**: The STA module extends the concept of superpixels from pixel space to token space, using super tokens as compact visual representations. This approach reduces redundancy by learning efficient global representations, especially in shallow layers. 2. **Multi-Head Setting**: The STA module employs a multi-head setting, increasing the number of attention heads to capture more regional information and determine its importance relative to decision-making. 3. **Parameter Optimization**: The authors conducted ablation studies to explore the impact of different token sizes and the number of attention heads on model performance, ultimately selecting the optimal parameter configuration. ### Experimental Results The authors conducted experiments on four public datasets, including Synapse multi-organ segmentation, Automated Cardiac Diagnosis Challenge (ACDC), nuclear segmentation (MoNuSeg), and gland segmentation in colon tissue slices (GlaS). The experimental results show that STA-UNet outperforms existing state-of-the-art methods in metrics such as Dice coefficient and Intersection over Union (IOU), particularly excelling in organ segmentation tasks. ### Main Contributions 1. **Revealing the Redundancy Issue**: The authors highlight the redundancy issue in the shallow modules of Transformer-based UNet architectures, promoting further research in this area. 2. **Integrating the STA Module**: The authors integrate the STA module into the UNet architecture, reducing the redundancy observed in other Transformer-based UNet models while retaining rich semantic information. 3. **Experimental Validation**: Through comprehensive evaluations on multiple datasets, the authors demonstrate the superiority of the proposed method in organ segmentation tasks.

STA-Unet: Rethink the semantic redundant for Medical Imaging Segmentation

CSWin-UNet: Transformer UNet with Cross-Shaped Windows for Medical Image Segmentation

MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet

AiA-UNet: Attention in Attention for Medical Image Segmentation

UNet based on dynamic convolution decomposition and triplet attention

SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical Image Segmentation

CAT-Unet: An Enhanced U-Net Architecture with Coordinate Attention and Skip-Neighborhood Attention Transformer for Medical Image Segmentation

UNet#: A UNet-like Redesigning Skip Connections for Medical Image Segmentation

TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation.

xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

Medical Image Segmentation Using Dual Branch Networks with Embedded Attention Mechanism.

ATTransUNet: an Enhanced Hybrid Transformer Architecture for Ultrasound and Histopathology Image Segmentation

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Semantic-Based Optimization of Deep Learning for Efficient Real-Time Medical Image Segmentation

DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation

DMSA-UNet: Dual Multi-Scale Attention makes UNet more strong for medical image segmentation

STA-Former: enhancing medical image segmentation with Shrinkage Triplet Attention in a hybrid CNN-Transformer model

TGDAUNet: Transformer and GCNN based dual-branch attention UNet for medical image segmentation

Sfe-Transunet: A Transformer-Based U-Net With Skipped Features Enhancer For Medical Image Segmentation

TTT-Unet: Enhancing U-Net with Test-Time Training Layers for Biomedical Image Segmentation

DSTUNet: UNet with Efficient Dense SWIN Transformer Pathway for Medical Image Segmentation