U-Netmer: U-Net meets Transformer for medical image segmentation

Sheng He,Rina Bao,P. Ellen Grant,Yangming Ou
2023-04-04
Abstract:The combination of the U-Net based deep learning models and Transformer is a new trend for medical image segmentation. U-Net can extract the detailed local semantic and texture information and Transformer can learn the long-rang dependencies among pixels in the input image. However, directly adapting the Transformer for segmentation has ``token-flatten" problem (flattens the local patches into 1D tokens which losses the interaction among pixels within local patches) and ``scale-sensitivity" problem (uses a fixed scale to split the input image into local patches). Compared to directly combining U-Net and Transformer, we propose a new global-local fashion combination of U-Net and Transformer, named U-Netmer, to solve the two problems. The proposed U-Netmer splits an input image into local patches. The global-context information among local patches is learnt by the self-attention mechanism in Transformer and U-Net segments each local patch instead of flattening into tokens to solve the `token-flatten" problem. The U-Netmer can segment the input image with different patch sizes with the identical structure and the same parameter. Thus, the U-Netmer can be trained with different patch sizes to solve the ``scale-sensitivity" problem. We conduct extensive experiments in 7 public datasets on 7 organs (brain, heart, breast, lung, polyp, pancreas and prostate) and 4 imaging modalities (MRI, CT, ultrasound, and endoscopy) to show that the proposed U-Netmer can be generally applied to improve accuracy of medical image segmentation. These experimental results show that U-Netmer provides state-of-the-art performance compared to baselines and other models. In addition, the discrepancy among the outputs of U-Netmer with different scales is linearly correlated to the segmentation accuracy which can be considered as a confidence score to rank test images by difficulty without ground-truth.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the two main problems encountered when combining U - Net and Transformer in medical image segmentation: the "token - flatten" problem and the "scale - sensitivity" problem. Specifically: 1. **"Token - flatten" problem**: When Vision Transformer processes local patches, it flattens these patches into one - dimensional tokens, which results in the loss of the interaction between local pixels. Although this processing method is helpful for capturing global dependency relationships, it is not conducive to retaining local detail information. 2. **"Scale - sensitivity" problem**: Vision Transformer usually uses a fixed ratio to divide the input image into local patches, which makes the performance of medical image segmentation very sensitive to the division ratio. Patches of different ratios may lead to different segmentation effects, and existing methods are often only optimized at a single ratio. To solve these problems, the paper proposes a new model - U - Netmer. U - Netmer solves the above problems in the following ways: - **Solving the "Token - flatten" problem**: U - Netmer uses a standard segmentation neural network (such as U - Net) to segment local patches instead of flattening them into one - dimensional tokens. In this way, the interaction between local pixels can be retained, and at the same time, Transformer can be used to learn global context information, thereby enhancing the segmentation effect of each local patch. - **Solving the "Scale - sensitivity" problem**: U - Netmer can be trained on different patch sizes and has the same network structure and parameters. Through multi - scale training, U - Netmer can learn multi - scale context information at different scales, thereby improving the robustness and accuracy of segmentation. The paper verifies the effectiveness of U - Netmer through extensive experiments on 7 public datasets and shows its superior performance in multiple organs and imaging modalities. In addition, U - Netmer can also output segmentation maps at different scales. The differences between these outputs are linearly related to the segmentation accuracy and can be used as confidence scores to evaluate the difficulty of test images.