Abstract:The combination of the U-Net based deep learning models and Transformer is a new trend for medical image segmentation. U-Net can extract the detailed local semantic and texture information and Transformer can learn the long-rang dependencies among pixels in the input image. However, directly adapting the Transformer for segmentation has ``token-flatten" problem (flattens the local patches into 1D tokens which losses the interaction among pixels within local patches) and ``scale-sensitivity" problem (uses a fixed scale to split the input image into local patches). Compared to directly combining U-Net and Transformer, we propose a new global-local fashion combination of U-Net and Transformer, named U-Netmer, to solve the two problems. The proposed U-Netmer splits an input image into local patches. The global-context information among local patches is learnt by the self-attention mechanism in Transformer and U-Net segments each local patch instead of flattening into tokens to solve the `token-flatten" problem. The U-Netmer can segment the input image with different patch sizes with the identical structure and the same parameter. Thus, the U-Netmer can be trained with different patch sizes to solve the ``scale-sensitivity" problem. We conduct extensive experiments in 7 public datasets on 7 organs (brain, heart, breast, lung, polyp, pancreas and prostate) and 4 imaging modalities (MRI, CT, ultrasound, and endoscopy) to show that the proposed U-Netmer can be generally applied to improve accuracy of medical image segmentation. These experimental results show that U-Netmer provides state-of-the-art performance compared to baselines and other models. In addition, the discrepancy among the outputs of U-Netmer with different scales is linearly correlated to the segmentation accuracy which can be considered as a confidence score to rank test images by difficulty without ground-truth.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the two main problems encountered when combining U - Net and Transformer in medical image segmentation: the "token - flatten" problem and the "scale - sensitivity" problem. Specifically: 1. **"Token - flatten" problem**: When Vision Transformer processes local patches, it flattens these patches into one - dimensional tokens, which results in the loss of the interaction between local pixels. Although this processing method is helpful for capturing global dependency relationships, it is not conducive to retaining local detail information. 2. **"Scale - sensitivity" problem**: Vision Transformer usually uses a fixed ratio to divide the input image into local patches, which makes the performance of medical image segmentation very sensitive to the division ratio. Patches of different ratios may lead to different segmentation effects, and existing methods are often only optimized at a single ratio. To solve these problems, the paper proposes a new model - U - Netmer. U - Netmer solves the above problems in the following ways: - **Solving the "Token - flatten" problem**: U - Netmer uses a standard segmentation neural network (such as U - Net) to segment local patches instead of flattening them into one - dimensional tokens. In this way, the interaction between local pixels can be retained, and at the same time, Transformer can be used to learn global context information, thereby enhancing the segmentation effect of each local patch. - **Solving the "Scale - sensitivity" problem**: U - Netmer can be trained on different patch sizes and has the same network structure and parameters. Through multi - scale training, U - Netmer can learn multi - scale context information at different scales, thereby improving the robustness and accuracy of segmentation. The paper verifies the effectiveness of U - Netmer through extensive experiments on 7 public datasets and shows its superior performance in multiple organs and imaging modalities. In addition, U - Netmer can also output segmentation maps at different scales. The differences between these outputs are linearly related to the segmentation accuracy and can be used as confidence scores to evaluate the difficulty of test images.

U-Netmer: U-Net meets Transformer for medical image segmentation

Mixed Transformer U-Net for Medical Image Segmentation

TF-Unet:An Automatic Cardiac MRI Image Segmentation Method

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers

HmsU-Net: A hybrid multi-scale U-net based on a CNN and transformer for medical image segmentation

Multi-scale Neighborhood Attention Transformer on U-Net for Medical Image Segmentation.

U-Net Transformer: Self and Cross Attention for Medical Image Segmentation

TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation.

UNETR: Transformers for 3D Medical Image Segmentation

MSCT-UNET: multi-scale contrastive transformer within U-shaped network for medical image segmentation

Sfe-Transunet: A Transformer-Based U-Net With Skipped Features Enhancer For Medical Image Segmentation

A novel full-convolution UNet-transformer for medical image segmentation

TSCA-Net: Transformer based spatial-channel attention segmentation network for medical images

TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers

FCTrans UNet: A Hybrid CNN and Transformer Model for Medical Image Segmentations

TransU²-Net: An Effective Medical Image Segmentation Framework Based on Transformer and U²-Net

Multiscale Transunet + + : Dense Hybrid U-Net with Transformer for Medical Image Segmentation

DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation

FTUNet: A Feature-Enhanced Network for Medical Image Segmentation Based on the Combination of U-Shaped Network and Vision Transformer

UCTNet: Uncertainty-guided CNN-Transformer hybrid networks for medical image segmentation