Multi-dimension Transformer with Attention-based Filtering for Medical Image Segmentation

Wentao Wang,Xi Xiao,Mingjie Liu,Qing Tian,Xuanyao Huang,Qizhen Lan,Swalpa Kumar Roy,Tianyang Wang
2024-05-21
Abstract:The accurate segmentation of medical images is crucial for diagnosing and treating diseases. Recent studies demonstrate that vision transformer-based methods have significantly improved performance in medical image segmentation, primarily due to their superior ability to establish global relationships among features and adaptability to various inputs. However, these methods struggle with the low signal-to-noise ratio inherent to medical images. Additionally, the effective utilization of channel and spatial information, which are essential for medical image segmentation, is limited by the representation capacity of self-attention. To address these challenges, we propose a multi-dimension transformer with attention-based filtering (MDT-AF), which redesigns the patch embedding and self-attention mechanism for medical image segmentation. MDT-AF incorporates an attention-based feature filtering mechanism into the patch embedding blocks and employs a coarse-to-fine process to mitigate the impact of low signal-to-noise ratio. To better capture complex structures in medical images, MDT-AF extends the self-attention mechanism to incorporate spatial and channel dimensions, enriching feature representation. Moreover, we introduce an interaction mechanism to improve the feature aggregation between spatial and channel dimensions. Experimental results on three public medical image segmentation benchmarks show that MDT-AF achieves state-of-the-art (SOTA) performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in medical image segmentation, as follows: 1. **Low Signal - to - Noise Ratio (Low Signal - to - Noise Ratio, SNR)**: Medical images usually have a low signal - to - noise ratio, which makes feature learning and discrimination difficult. Existing Transformer - based methods perform poorly when dealing with this low signal - to - noise ratio, affecting the segmentation performance. 2. **Effective utilization of spatial and channel information**: In medical image segmentation, spatial and channel information is crucial for accurately capturing complex structures. However, the existing self - attention mechanisms have limitations in representing and aggregating these multi - dimensional information, especially when dealing with medical images of different shapes and scenes. 3. **Global relationship modeling**: Although Transformer - based methods can well establish global relationships between features and adapt to various inputs, they still face challenges in dealing with long - distance dependencies, especially in medical images where the shapes and sizes of target regions may vary greatly. To address these problems, the authors propose a multi - dimensional Transformer model, called MDT - AF (Multi - dimension Transformer with Attention - based Filtering), with the main improvements including: - **Patch Embedding with attention mechanism filtering**: MDT - AF redesigns the Patch Embedding module and introduces a parallel attention mechanism filtering branch to refine coarse features and reduce noise. This mechanism filters the initial features by generating attention weights, thereby improving the feature quality and enabling the model to focus more on relevant signals. - **Self - attention mechanism extended to spatial and channel dimensions**: MDT - AF extends the self - attention mechanism to spatial and channel dimensions to better capture complex structures. By performing feature interaction and aggregation within the block, the richness of feature representation is enhanced, and the modeling ability for medical images of different shapes and scenes is improved. - **Feature interaction mechanism**: An interaction mechanism is introduced to dynamically re - weight the features from the spatial or channel dimensions, so as to better fuse the features of the two branches. Through these improvements, MDT - AF achieves state - of - the - art performance on three publicly available medical image segmentation benchmark datasets, demonstrating its advantages in dealing with low signal - to - noise ratios and complex structures. ### Formula summary - **Calculation formula of attention mechanism filtering**: \[ X_{out}(i)=A\odot F_1 \] where \(A\) is the attention weight, \(F_1\) is the coarse feature obtained from overlapping Patch Embedding, and \(\odot\) represents element - wise multiplication. - **Multi - scale feature extraction formula**: \[ V_{mpmv}=\text{Concat}(V_{l1}, V_{l2}, V_{l3}) \] where \(V_{l1}\), \(V_{l2}\), \(V_{l3}\) are the feature maps obtained by convolutions with different dilation rates respectively. - **Efficient self - attention (ESA) formula**: \[ Y_E = \text{ESA}(X_{in})+X_{in} \] - **Spatial self - attention (SSA) formula**: \[ Y_S=(I_c(Y_{Sp}, Y_{local})+I_s(Y_{local}, Y_{Sp}))W_{merge}+X_{in} \] - **Channel self - attention (CSA) formula**: \[ Y_C=(I_s(Y_{Ch}, Y_{local})+I_c(Y_{local}, Y_{Ch}))W_