Abstract:ABSTRACT Due to the Transformer's ability to capture long‐range dependencies through Self‐Attention, it has shown immense potential in medical image segmentation. However, it lacks the capability to model local relationships between pixels. Therefore, many previous approaches embedded the Transformer into the CNN encoder. However, current methods often fall short in modeling the relationships between multi‐scale features, specifically the spatial correspondence between features at different scales. This limitation can result in the ineffective capture of scale differences for each object and the loss of features for small targets. Furthermore, due to the high complexity of the Transformer, it is challenging to integrate local and global information within the same scale effectively. To address these limitations, we propose a novel backbone network called CasUNeXt, which features three appealing design elements: (1) We use the idea of cascade to redesign the way CNN and Transformer are combined to enhance modeling the unique interrelationships between multi‐scale information. (2) We design a Cascaded Scale‐wise Transformer Module capable of cross‐scale interactions. It not only strengthens feature extraction within a single scale but also models interactions between different scales. (3) We overhaul the multi‐head Channel Attention mechanism to enable it to model context information in feature maps from multiple perspectives within the channel dimension. These design features collectively enable CasUNeXt to better integrate local and global information and capture relationships between multi‐scale features, thereby improving the performance of medical image segmentation. Through experimental comparisons on various benchmark datasets, our CasUNeXt method exhibits outstanding performance in medical image segmentation tasks, surpassing the current state‐of‐the‐art methods.

ScaleNet: Rethinking Feature Interaction from a Scale-Wise Perspective for Medical Image Segmentation.

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation.

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

GCFormer: Multi-scale Feature Plays a Crucial Role in Medical Images Segmentation

MS-TCNet: An effective Transformer–CNN combined network using multi-scale feature learning for 3D medical image segmentation

Sub-pixel multi-scale fusion network for medical image segmentation

Scale-wise Discriminative Region Learning for Medical Image Segmentation

MSTCNet: Parallel Multi-Scale Network For Medical Image Segmentation.

A Lightweight Multi-Scale Multi-Angle Dynamic Interactive Transformer-CNN Fusion Model for 3D Medical Image Segmentation

CasUNeXt: A Cascaded Transformer With Intra‐ and Inter‐Scale Information for Medical Image Segmentation

[Multi-scale medical image segmentation based on pixel encoding and spatial attention mechanism]

Transformer Scale Gate for Semantic Segmentation

A Dynamic Cross-Scale Transformer with Dual-Compound Representation for 3D Medical Image Segmentation

HmsU-Net: A hybrid multi-scale U-net based on a CNN and transformer for medical image segmentation

BMCS-Net: A Bi-directional multi-scale cascaded segmentation network based on transformer-guided feature Aggregation for medical images

ECSFF: Exploring Efficient Cross-Scale Feature Fusion for Medical Image Segmentation.

Feature ensemble network for medical image segmentation with multi‐scale atrous transformer

Multi-scale Orthogonal Model CNN-Transformer For Medical Image Segmentation

FI‐Net: Rethinking Feature Interactions for Medical Image Segmentation

A Hybrid Cross-Scale Transformer Architecture for Robust Medical Image Segmentation.

CTC-Net: A Novel Coupled Feature-Enhanced Transformer and Inverted Convolution Network for Medical Image Segmentation