Abstract:The accurate segmentation of medical images is crucial for diagnosing and treating diseases. Recent studies demonstrate that vision transformer-based methods have significantly improved performance in medical image segmentation, primarily due to their superior ability to establish global relationships among features and adaptability to various inputs. However, these methods struggle with the low signal-to-noise ratio inherent to medical images. Additionally, the effective utilization of channel and spatial information, which are essential for medical image segmentation, is limited by the representation capacity of self-attention. To address these challenges, we propose a multi-dimension transformer with attention-based filtering (MDT-AF), which redesigns the patch embedding and self-attention mechanism for medical image segmentation. MDT-AF incorporates an attention-based feature filtering mechanism into the patch embedding blocks and employs a coarse-to-fine process to mitigate the impact of low signal-to-noise ratio. To better capture complex structures in medical images, MDT-AF extends the self-attention mechanism to incorporate spatial and channel dimensions, enriching feature representation. Moreover, we introduce an interaction mechanism to improve the feature aggregation between spatial and channel dimensions. Experimental results on three public medical image segmentation benchmarks show that MDT-AF achieves state-of-the-art (SOTA) performance.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in medical image segmentation, as follows: 1. **Low Signal - to - Noise Ratio (Low Signal - to - Noise Ratio, SNR)**: Medical images usually have a low signal - to - noise ratio, which makes feature learning and discrimination difficult. Existing Transformer - based methods perform poorly when dealing with this low signal - to - noise ratio, affecting the segmentation performance. 2. **Effective utilization of spatial and channel information**: In medical image segmentation, spatial and channel information is crucial for accurately capturing complex structures. However, the existing self - attention mechanisms have limitations in representing and aggregating these multi - dimensional information, especially when dealing with medical images of different shapes and scenes. 3. **Global relationship modeling**: Although Transformer - based methods can well establish global relationships between features and adapt to various inputs, they still face challenges in dealing with long - distance dependencies, especially in medical images where the shapes and sizes of target regions may vary greatly. To address these problems, the authors propose a multi - dimensional Transformer model, called MDT - AF (Multi - dimension Transformer with Attention - based Filtering), with the main improvements including: - **Patch Embedding with attention mechanism filtering**: MDT - AF redesigns the Patch Embedding module and introduces a parallel attention mechanism filtering branch to refine coarse features and reduce noise. This mechanism filters the initial features by generating attention weights, thereby improving the feature quality and enabling the model to focus more on relevant signals. - **Self - attention mechanism extended to spatial and channel dimensions**: MDT - AF extends the self - attention mechanism to spatial and channel dimensions to better capture complex structures. By performing feature interaction and aggregation within the block, the richness of feature representation is enhanced, and the modeling ability for medical images of different shapes and scenes is improved. - **Feature interaction mechanism**: An interaction mechanism is introduced to dynamically re - weight the features from the spatial or channel dimensions, so as to better fuse the features of the two branches. Through these improvements, MDT - AF achieves state - of - the - art performance on three publicly available medical image segmentation benchmark datasets, demonstrating its advantages in dealing with low signal - to - noise ratios and complex structures. ### Formula summary - **Calculation formula of attention mechanism filtering**: \[ X_{out}(i)=A\odot F_1 \] where \(A\) is the attention weight, \(F_1\) is the coarse feature obtained from overlapping Patch Embedding, and \(\odot\) represents element - wise multiplication. - **Multi - scale feature extraction formula**: \[ V_{mpmv}=\text{Concat}(V_{l1}, V_{l2}, V_{l3}) \] where \(V_{l1}\), \(V_{l2}\), \(V_{l3}\) are the feature maps obtained by convolutions with different dilation rates respectively. - **Efficient self - attention (ESA) formula**: \[ Y_E = \text{ESA}(X_{in})+X_{in} \] - **Spatial self - attention (SSA) formula**: \[ Y_S=(I_c(Y_{Sp}, Y_{local})+I_s(Y_{local}, Y_{Sp}))W_{merge}+X_{in} \] - **Channel self - attention (CSA) formula**: \[ Y_C=(I_s(Y_{Ch}, Y_{local})+I_c(Y_{local}, Y_{Ch}))W_

Multi-dimension Transformer with Attention-based Filtering for Medical Image Segmentation

Mixed Transformer U-Net for Medical Image Segmentation

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

HMDA: A Hybrid Model with Multi-scale Deformable Attention for Medical Image Segmentation

VSmTrans: A Hybrid Paradigm Integrating Self-attention and Convolution for 3D Medical Image Segmentation

MSGAT: Multi-scale gated axial reverse attention transformer network for medical image segmentation

Slimmable transformer with hybrid axial-attention for medical image segmentation

MultiTrans: Multi-branch transformer network for medical image segmentation

Parameter-Efficient Transformer with Hybrid Axial-Attention for Medical Image Segmentation

Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) for 3D Medical Image Segmentation and Visualization

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation

TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation with Transformers

H2Former: An Efficient Hierarchical Hybrid Transformer for Medical Image Segmentation

A Hybrid Enhanced Attention Transformer Network for Medical Ultrasound Image Segmentation

SMAFormer: Synergistic Multi-Attention Transformer for Medical Image Segmentation

DAMAF: dual attention network with multi-level adaptive complementary fusion for medical image segmentation

STA-Former: enhancing medical image segmentation with Shrinkage Triplet Attention in a hybrid CNN-Transformer model

Dual-attention transformer-based hybrid network for multi-modal medical image segmentation

TPAFNet: Transformer-Driven Pyramid Attention Fusion Network for 3D Medical Image Segmentation

DuAT: Dual-Aggregation Transformer Network for Medical Image Segmentation