UNesT: Local Spatial Representation Learning with Hierarchical Transformer for Efficient Medical Segmentation

Xin Yu,Qi Yang,Yinchi Zhou,Leon Y. Cai,Riqiang Gao,Ho Hin Lee,Thomas Li,Shunxing Bao,Zhoubing Xu,Thomas A. Lasko,Richard G. Abramson,Zizhao Zhang,Yuankai Huo,Bennett A. Landman,Yucheng Tang
2023-09-08
Abstract:Transformer-based models, capable of learning better global dependencies, have recently demonstrated exceptional representation learning capabilities in computer vision and medical image analysis. Transformer reformats the image into separate patches and realizes global communication via the self-attention mechanism. However, positional information between patches is hard to preserve in such 1D sequences, and loss of it can lead to sub-optimal performance when dealing with large amounts of heterogeneous tissues of various sizes in 3D medical image segmentation. Additionally, current methods are not robust and efficient for heavy-duty medical segmentation tasks such as predicting a large number of tissue classes or modeling globally inter-connected tissue structures. To address such challenges and inspired by the nested hierarchical structures in vision transformer, we proposed a novel 3D medical image segmentation method (UNesT), employing a simplified and faster-converging transformer encoder design that achieves local communication among spatially adjacent patch sequences by aggregating them hierarchically. We extensively validate our method on multiple challenging datasets, consisting of multiple modalities, anatomies, and a wide range of tissue classes, including 133 structures in the brain, 14 organs in the abdomen, 4 hierarchical components in the kidneys, inter-connected kidney tumors and brain tumors. We show that UNesT consistently achieves state-of-the-art performance and evaluate its generalizability and data efficiency. Particularly, the model achieves whole brain segmentation task complete ROI with 133 tissue classes in a single network, outperforming prior state-of-the-art method SLANT27 ensembled with 27 networks.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively handle a large number of tissue structures of different sizes and heterogeneity in 3D medical image segmentation tasks, especially for heavy medical segmentation tasks that require predicting a large number of tissue categories or modeling globally interconnected tissue structures. Specifically, the authors point out the shortcomings of existing methods in the following aspects: 1. **Loss of Positional Information**: Existing Transformer-based methods achieve global communication by converting images into 1D sequences, but this approach makes it difficult to retain positional information between blocks, leading to performance degradation when processing 3D medical image segmentation. 2. **Lack of Robustness and Efficiency**: Current methods are not robust and efficient enough when dealing with a large number of tissue categories or modeling globally interconnected tissue structures. 3. **Low Data Efficiency**: Due to the lack of local inductive bias, existing Transformer methods usually require a large amount of training data, which is expensive and difficult to obtain in the medical field. To overcome these challenges, the authors propose a new 3D medical image segmentation method (UNesT), which achieves local communication between spatially adjacent block sequences through a simplified and faster converging Transformer encoder design, and improves model performance and data efficiency by hierarchically aggregating these blocks. UNesT has been validated on multiple challenging datasets, including segmentation tasks of the brain, abdominal organs, and kidney substructures, demonstrating its superior performance and generalization ability.