Abstract:Transformers have demonstrated significant promise for computer vision tasks. Particularly noteworthy is SwinUNETR, a model that employs vision transformers, which has made remarkable advancements in improving the process of segmenting medical images. Nevertheless, the efficacy of training process of SwinUNETR has been constrained by an extended training duration, a limitation primarily attributable to the integration of the attention mechanism within the architecture. In this article, to address this limitation, we introduce a novel framework, called the MetaSwin model. Drawing inspiration from the MetaFormer concept that uses other token mix operations, we propose a transformative modification by substituting attention-based components within SwinUNETR with a straightforward yet impactful spatial pooling operation. Additionally, we incorporate of Squeeze-and-Excitation (SE) blocks after each MetaSwin block of the encoder and into the decoder, which aims at segmentation performance. We evaluate our proposed MetaSwin model on two distinct medical datasets, namely BraTS 2023 and MICCAI 2015 BTCV, and conduct a comprehensive comparison with the two baselines, i.e. , SwinUNETR and SwinUNETR+SE models. Our results emphasize the effectiveness of MetaSwin, showcasing its competitive edge against the baselines, utilizing a simple pooling operation and efficient SE blocks. MetaSwin’s consistent and superior performance on the BTCV dataset, in comparison to SwinUNETR, is particularly significant. For instance, with a model size of 24, MetaSwin outperforms SwinUNETR’s 76.58% Dice score using fewer parameters (15,407,384 vs 15,703,304) and a substantially reduced training time (300 vs 467 mins), achieving an improved Dice score of 79.12%. This research highlights the essential contribution of a simplified transformer framework, incorporating basic elements such as pooling and SE blocks, thus emphasizing their potential to guide the progression of medical segmentation models, without relying on complex attention-based mechanisms.

SparseSwin: Swin Transformer with Sparse Transformer Block

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

SPT-Swin: A Shifted Patch Tokenization Swin Transformer for Image Classification

MetaSwin: a unified meta vision transformer model for medical image segmentation

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

S-Swin Transformer: simplified Swin Transformer model for offline handwritten Chinese character recognition

SwinIR: Image Restoration Using Swin Transformer

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

P-Swin: Parallel Swin transformer multi-scale semantic segmentation network for land cover classification

SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation

Resolution enhancement processing on low quality images using swin transformer based on interval dense connection strategy

Swin Transformer V2: Scaling Up Capacity and Resolution

Sparse then Prune: Toward Efficient Vision Transformers

SFRSwin: A Shallow Significant Feature Retention Swin Transformer for Fine-Grained Image Classification of Wildlife Species.

SwiFT: Swin 4D fMRI Transformer

An Efficient FPGA-Based Accelerator for Swin Transformer

Swin Transformer for Fast MRI