UNesT: Local Spatial Representation Learning with Hierarchical Transformer for Efficient Medical Segmentation

Xin Yu,Qi Yang,Yinchi Zhou,Leon Y. Cai,Riqiang Gao,Ho Hin Lee,Thomas Li,Shunxing Bao,Zhoubing Xu,Thomas A. Lasko,Richard G. Abramson,Zizhao Zhang,Yuankai Huo,Bennett A. Landman,Yucheng Tang

2023-09-08

Abstract:Transformer-based models, capable of learning better global dependencies, have recently demonstrated exceptional representation learning capabilities in computer vision and medical image analysis. Transformer reformats the image into separate patches and realizes global communication via the self-attention mechanism. However, positional information between patches is hard to preserve in such 1D sequences, and loss of it can lead to sub-optimal performance when dealing with large amounts of heterogeneous tissues of various sizes in 3D medical image segmentation. Additionally, current methods are not robust and efficient for heavy-duty medical segmentation tasks such as predicting a large number of tissue classes or modeling globally inter-connected tissue structures. To address such challenges and inspired by the nested hierarchical structures in vision transformer, we proposed a novel 3D medical image segmentation method (UNesT), employing a simplified and faster-converging transformer encoder design that achieves local communication among spatially adjacent patch sequences by aggregating them hierarchically. We extensively validate our method on multiple challenging datasets, consisting of multiple modalities, anatomies, and a wide range of tissue classes, including 133 structures in the brain, 14 organs in the abdomen, 4 hierarchical components in the kidneys, inter-connected kidney tumors and brain tumors. We show that UNesT consistently achieves state-of-the-art performance and evaluate its generalizability and data efficiency. Particularly, the model achieves whole brain segmentation task complete ROI with 133 tissue classes in a single network, outperforming prior state-of-the-art method SLANT27 ensembled with 27 networks.

Image and Video Processing,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the problem of how to effectively handle a large number of tissue structures of different sizes and heterogeneity in 3D medical image segmentation tasks, especially for heavy medical segmentation tasks that require predicting a large number of tissue categories or modeling globally interconnected tissue structures. Specifically, the authors point out the shortcomings of existing methods in the following aspects: 1. **Loss of Positional Information**: Existing Transformer-based methods achieve global communication by converting images into 1D sequences, but this approach makes it difficult to retain positional information between blocks, leading to performance degradation when processing 3D medical image segmentation. 2. **Lack of Robustness and Efficiency**: Current methods are not robust and efficient enough when dealing with a large number of tissue categories or modeling globally interconnected tissue structures. 3. **Low Data Efficiency**: Due to the lack of local inductive bias, existing Transformer methods usually require a large amount of training data, which is expensive and difficult to obtain in the medical field. To overcome these challenges, the authors propose a new 3D medical image segmentation method (UNesT), which achieves local communication between spatially adjacent block sequences through a simplified and faster converging Transformer encoder design, and improves model performance and data efficiency by hierarchically aggregating these blocks. UNesT has been validated on multiple challenging datasets, including segmentation tasks of the brain, abdominal organs, and kidney substructures, demonstrating its superior performance and generalization ability.

UNesT: Local Spatial Representation Learning with Hierarchical Transformer for Efficient Medical Segmentation

UNesT: Local spatial representation learning with hierarchical transformer for efficient medical segmentation

Mixed Transformer U-Net for Medical Image Segmentation

3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers

UNETR: Transformers for 3D Medical Image Segmentation

ETUNet:Exploring efficient transformer enhanced UNet for 3D brain tumor segmentation

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

H2MaT-Unet:Hierarchical hybrid multi-axis transformer based Unet for medical image segmentation

Multi-scale Neighborhood Attention Transformer on U-Net for Medical Image Segmentation.

3D Medical image segmentation using parallel transformers

A novel full-convolution UNet-transformer for medical image segmentation

TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation.

Nnformer: Volumetric Medical Image Segmentation Via a 3D Transformer

MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet

H2Former: An Efficient Hierarchical Hybrid Transformer for Medical Image Segmentation

A Novel Deep Learning Model for Medical Image Segmentation with Convolutional Neural Network and Transformer

Enhanced Transformer Encoder and Hybrid Cascaded Upsampler for Medical Image Segmentation.

TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers

UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation

DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation

HCT-Unet: multi-target medical image segmentation via a hybrid CNN-transformer Unet incorporating multi-axis gated multi-layer perceptron