Abstract:Convolutional neural networks (CNNs) have found extensive use in medical image segmentation tasks. However, they encounter limitations in capturing long-range semantic interactions. Conversely, Transformers excel at handling long-range dependencies but struggle to preserve local semantic details. To address this challenge, we propose STA-Former, a hybrid CNN-Transformer model for medical image segmentation. Our approach is founded on three fundamental principles: (1) We propose the Shrinkage Triplet Attention (STA) module to enhance feature fusion within the decoder. It focuses on spatial and channel interactions in the feature map, computes thresholds across dimensions, and suppresses irrelevant information through soft-thresholding. (2) We present a redesigned hierarchical hybrid CNN-Transformer encoder that connects CNN and Transformer blocks at multiple scales, enabling the capture of both long-range and short-range dependencies across various scales of feature maps. (3) Unlike traditional decoders that apply the attention mechanism exclusively to low-level features, our approach utilizes a multiscale attention hierarchical decoder, leveraging feature map correlations at different scales for effective feature fusion. Our method exhibits superior performance compared to the state-of-the-art methods on three datasets: Synapse multiorgan CT, ACDC cardiac MRI scans, and breast ultrasound image.

STA-Former: enhancing medical image segmentation with Shrinkage Triplet Attention in a hybrid CNN-Transformer model