Abstract:Purpose: Deep learning-based networks have become increasingly popular in the field of medical image segmentation. The purpose of this research was to develop and optimize a new architecture for automatic segmentation of the prostate gland and normal organs in the pelvic, thoracic, and upper gastro-intestinal (GI) regions. Methods: We developed an architecture which combines a shifted-window (Swin) transformer with a convolutional U-Net. The network includes a parallel encoder, a cross-fusion block, and a CNN-based decoder to extract local and global information and merge related features on the same scale. A skip connection is applied between the cross-fusion block and decoder to integrate low-level semantic features. Attention gates (AGs) are integrated within the CNN to suppress features in image background regions. Our network is termed "SwinAttUNet." We optimized the architecture for automatic image segmentation. Training datasets consisted of planning-CT datasets from 300 prostate cancer patients from an institutional database and 100 CT datasets from a publicly available dataset (CT-ORG). Images were linearly interpolated and resampled to a spatial resolution of (1.0 × 1.0× 1.5) mm3 . A volume patch (192 × 192 × 96) was used for training and inference, and the dataset was split into training (75%), validation (10%), and test (15%) cohorts. Data augmentation transforms were applied consisting of random flip, rotation, and intensity scaling. The loss function comprised Dice and cross-entropy equally weighted and summed. We evaluated Dice coefficients (DSC), 95th percentile Hausdorff Distances (HD95), and Average Surface Distances (ASD) between results of our network and ground truth data. Results: SwinAttUNet, DSC values were 86.54 ± 1.21, 94.15 ± 1.17, and 87.15 ± 1.68% and HD95 values were 5.06 ± 1.42, 3.16 ± 0.93, and 5.54 ± 1.63 mm for the prostate, bladder, and rectum, respectively. Respective ASD values were 1.45 ± 0.57, 0.82 ± 0.12, and 1.42 ± 0.38 mm. For the lung, liver, kidneys and pelvic bones, respective DSC values were: 97.90 ± 0.80, 96.16 ± 0.76, 93.74 ± 2.25, and 89.31 ± 3.87%. Respective HD95 values were: 5.13 ± 4.11, 2.73 ± 1.19, 2.29 ± 1.47, and 5.31 ± 1.25 mm. Respective ASD values were: 1.88 ± 1.45, 1.78 ± 1.21, 0.71 ± 0.43, and 1.21 ± 1.11 mm. Our network outperformed several existing deep learning approaches using only attention-based convolutional or Transformer-based feature strategies, as detailed in the results section. Conclusions: We have demonstrated that our new architecture combining Transformer- and convolution-based features is able to better learn the local and global context for automatic segmentation of multi-organ, CT-based anatomy.

Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation

HCT-Unet: multi-target medical image segmentation via a hybrid CNN-transformer Unet incorporating multi-axis gated multi-layer perceptron

CiT-Net: Convolutional Neural Networks Hand in Hand with Vision Transformers for Medical Image Segmentation

Hybrid CNN-transformer Network for Interactive Learning of Challenging Musculoskeletal Images.

MSCT-UNET: multi-scale contrastive transformer within U-shaped network for medical image segmentation

HCT-net: hybrid CNN-transformer model based on a neural architecture search network for medical image segmentation

TSCA-Net: Transformer based spatial-channel attention segmentation network for medical images

A new architecture combining convolutional and transformer-based networks for automatic 3D multi-organ segmentation on CT images

Mixup Augmentation for Kidney and Kidney Tumor Segmentation

TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for Medical Image Segmentation

DeSTNet: Densely Fused Spatial Transformer Networks

S3TU-Net: Structured Convolution and Superpixel Transformer for Lung Nodule Segmentation

D-TrAttUnet: Toward Hybrid CNN-Transformer Architecture for Generic and Subtle Segmentation in Medical Images

STC-UNet: renal tumor segmentation based on enhanced feature extraction at different network levels

Automatic Segmentation of Kidney Tumor Based on Cascaded Multiscale Convolutional Neural Networks

Multi-Scale Supervised 3D U-Net for Kidneys and Kidney Tumor Segmentation

STA-Unet: Rethink the semantic redundant for Medical Imaging Segmentation

TTT-Unet: Enhancing U-Net with Test-Time Training Layers for Biomedical Image Segmentation

Vision Transformers increase efficiency of 3D cardiac CT multi-label segmentation

Kid-Net: Convolution Networks for Kidney Vessels Segmentation from CT-Volumes

ITUnet: Integration Of Transformers And Unet For Organs-At-Risk Segmentation