Abstract:Despite the numerous advancements in Convolutional Neural Networks (CNNs) and Transformers, especially in the field of medical image segmentation, two fundamental issues remain. First, the image segmentation task often struggles with effectively modelling global contexts with multi-scales to achieve accurate segmentation results. The second issue concerns the computational burden associated with processing high-resolution medical images and producing fine-grained predictions. Dealing with this level of detail, demands significant computational resources, leading to a computationally intensive process. UNet-like encoder-decoder architectures, which are still the number one widely used architecture in many state-of-the-art applications, struggle to address these complications. While UNet's naive skip connections help to recover spatial information, they fall short in capturing the hierarchical relationships at different scales and the overall context of the image as they combine features from different layers without accounting for their differences, which leads to less accurate segmentation results. We propose an enhanced UNet-like Transformer-based framework with attentive skip connections to tackle these problems: first, instead of simply integrating features extracted from the encoder with the decoder, we added a Transformer-based skip connection module, and second, we optimized the calculations within the skip connection module by employing a merging cross-covariance attention mechanism rather than the conventional self-attention operation, which not only bridges the gaps between multiple levels of semantics and captures more complex dependencies but can also process high-resolution images more efficiently due to its linear complexity in the number of tokens. While retaining the U-shaped encoder-decoder structure, we also replace UNet's CNN layers with hierarchically equivalent Swin Transformer blocks, capturing both global interactions and local dependencies.

Enhancing medical image segmentation with a multi-transformer U-Net

Mixed Transformer U-Net for Medical Image Segmentation

TF-Unet:An Automatic Cardiac MRI Image Segmentation Method

Swin-TransUper: Swin Transformer-based UperNet for medical image segmentation

SwinUNETR-V2: Stronger Swin Transformers with Stagewise Convolutions for 3D Medical Image Segmentation

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

SSTrans-Net: Smart Swin Transformer Network for medical image segmentation

DSTUNet: UNet with Efficient Dense SWIN Transformer Pathway for Medical Image Segmentation

DS-TransUNet:Dual Swin Transformer U-Net for Medical Image Segmentation

DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation

High-Resolution Swin Transformer for Automatic Medical Image Segmentation

CT-Net: Asymmetric compound branch Transformer for medical image segmentation

SAttisUNet: UNet-like Swin Transformer with Attentive Skip Connections for Enhanced Medical Image Segmentation

MaS-TransUNet: A Multi-Attention Swin Transformer U-Net for Medical Image Segmentation

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Swin-Net: A Swin-Transformer-Based Network Combing with Multi-Scale Features for Segmentation of Breast Tumor Ultrasound Images

Sfe-Transunet: A Transformer-Based U-Net With Skipped Features Enhancer For Medical Image Segmentation

SWTRU: Star-shaped Window Transformer Reinforced U-Net for medical image segmentation

Going Beyond U-Net: Assessing Vision Transformers for Semantic Segmentation in Microscopy Image Analysis

A Combined Deformable Model and Medical Transformer Algorithm for Medical Image Segmentation

Multiscale Transunet + + : Dense Hybrid U-Net with Transformer for Medical Image Segmentation