Abstract:Most recent 3D medical image segmentation methods adopt convolutional neural networks (CNNs) that rely on deep feature representation and achieve adequate performance. However, due to the convolutional architectures having limited receptive fields, they cannot explicitly model the long-range dependencies in the medical image. Recently, Transformer can benefit from global dependencies using self-attention mechanisms and learn highly expressive representations. Some works were designed based on the Transformers, but the existing Transformers suffer from extreme computational and memories, and they cannot take full advantage of the powerful feature representations in 3D medical image segmentation. In this paper, we aim to connect the different resolution streams in parallel and propose a novel network, named Trans former based H igh R esolution Net work (TransHRNet), with an Effective Transformer (EffTrans) block, which has sufficient feature representation even at high feature resolutions . Given a 3D image, the encoder first utilizes CNN to extract the feature representations to capture the local information, and then the different feature maps are reshaped elaborately for tokens that are fed into each Transformer stream in parallel to learn the global information and repeatedly exchange the information across streams. Unfortunately, the proposed framework based on the standard Transformer needs a huge amount of computation, thus we introduce a deep and effective Transformer to deliver better performance with fewer parameters. The proposed TransHRNet is evaluated on the Multi-Atlas Labeling Beyond the Cranial Vault (BCV) dataset that consists of 11 major human organs and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Experimental results show that it performs better than the convolutional and other related Transformer-based methods on the 3D multi-organ segmentation tasks. Code is available at https://github.com/duweidai/TransHRNet .

MAXFormer: Enhanced Transformer for Medical Image Segmentation with Multi-Attention and Multi-Scale Features Fusion

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation.

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

ConvFormer: Combining CNN and Transformer for Medical Image Segmentation

STA-Former: enhancing medical image segmentation with Shrinkage Triplet Attention in a hybrid CNN-Transformer model

H2Former: An Efficient Hierarchical Hybrid Transformer for Medical Image Segmentation

MISSFormer: An Effective Medical Image Segmentation Transformer

HD-Former: A hierarchical dependency Transformer for medical image segmentation

Cross Attention Multi Scale CNN-Transformer Hybrid Encoder is General Medical Image Learner.

A Lightweight Multi-Scale Multi-Angle Dynamic Interactive Transformer-CNN Fusion Model for 3D Medical Image Segmentation

SCA-Former: transformer-like network based on stream-cross attention for medical image segmentation

MISSFormer: an Effective Transformer for 2D Medical Image Segmentation

MaxViT-UNet: Multi-Axis Attention for Medical Image Segmentation

3D Medical image segmentation using parallel transformers

MS-TCNet: An effective Transformer–CNN combined network using multi-scale feature learning for 3D medical image segmentation

A Multi-Scale Cross-Fusion Medical Image Segmentation Network Based on Dual-Attention Mechanism Transformer

TSCA-Net: Transformer based spatial-channel attention segmentation network for medical images

SCANeXt: Enhancing 3D Medical Image Segmentation with Dual Attention Network and Depth-Wise Convolution

Hybrid-scale Contextual Fusion Network for Medical Image Segmentation

MSCT-UNET: multi-scale contrastive transformer within U-shaped network for medical image segmentation