3D-EffiViTCaps: 3D Efficient Vision Transformer with Capsule for Medical Image Segmentation

Dongwei Gan,Ming Chang,Juan Chen

2024-03-25

Abstract:Medical image segmentation (MIS) aims to finely segment various organs. It requires grasping global information from both parts and the entire image for better segmenting, and clinically there are often certain requirements for segmentation efficiency. Convolutional neural networks (CNNs) have made considerable achievements in MIS. However, they are difficult to fully collect global context information and their pooling layer may cause information loss. Capsule networks, which combine the benefits of CNNs while taking into account additional information such as relative location that CNNs do not, have lately demonstrated some advantages in MIS. Vision Transformer (ViT) employs transformers in visual tasks. Transformer based on attention mechanism has excellent global inductive modeling capabilities and is expected to capture longrange information. Moreover, there have been resent studies on making ViT more lightweight to minimize model complexity and increase efficiency. In this paper, we propose a U-shaped 3D encoder-decoder network named 3D-EffiViTCaps, which combines 3D capsule blocks with 3D EfficientViT blocks for MIS. Our encoder uses capsule blocks and EfficientViT blocks to jointly capture local and global semantic information more effectively and efficiently with less information loss, while the decoder employs CNN blocks and EfficientViT blocks to catch ffner details for segmentation. We conduct experiments on various datasets, including iSeg-2017, Hippocampus and Cardiac to verify the performance and efficiency of 3D-EffiViTCaps, which performs better than previous 3D CNN-based, 3D Capsule-based and 3D Transformer-based models. We further implement a series of ablation experiments on the main blocks. Our code is available at:

Image and Video Processing,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper aims to address issues in Medical Image Segmentation (MIS), particularly how to effectively capture global information and reduce information loss. Traditional Convolutional Neural Networks (CNNs) face difficulties in capturing global contextual information, and pooling layers may lead to information loss. Although capsule networks consider the relative positional information ignored by CNNs, their performance in MIS is not initially outstanding. On the other hand, Visual Transformers (ViT) based on attention mechanisms excel at handling long-range dependencies but may suffer from efficiency issues. The paper proposes a U-shaped 3D encoder-decoder network called 3D-EffiViTCaps, which combines 3D capsule blocks and 3D EfficientViT blocks. The encoder of this model utilizes capsule blocks and EfficientViT blocks to more effectively capture local and global semantic information, while the decoder adopts CNN blocks and EfficientViT blocks to capture finer details for segmentation. Through experiments on multiple datasets, 3D-EffiViTCaps outperforms previous 3D CNN, 3D capsule networks, and 3D Transformer models in both performance and efficiency. The research also conducts ablation experiments to verify the effectiveness of the key blocks and emphasizes the advantages of 3D EfficientViT blocks in improving model performance and balancing efficiency. The contribution of the paper lies in improving the segmentation performance of MIS models by modeling the part-whole relationship with 3D capsules and better extracting global semantic information using 3D EfficientViT, while maintaining reasonable model efficiency.

3D-EffiViTCaps: 3D Efficient Vision Transformer with Capsule for Medical Image Segmentation

SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

MMViT-Seg: A Lightweight Transformer and CNN Fusion Network for COVID-19 Segmentation.

Seg-CapNet: A Capsule-Based Neural Network for the Segmentation of Left Ventricle from Cardiac Magnetic Resonance Imaging

Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) for 3D Medical Image Segmentation and Visualization

VSmTrans: A Hybrid Paradigm Integrating Self-attention and Convolution for 3D Medical Image Segmentation

EPT-Net: Edge Perception Transformer for 3D Medical Image Segmentation

Hybrid CNN-Transformer model for medical image segmentation with pyramid convolution and multi-layer perceptron

D-former: a U-shaped Dilated Transformer for 3D medical image segmentation

EViT-Unet: U-Net Like Efficient Vision Transformer for Medical Image Segmentation on Mobile and Edge Devices

UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation

FDR-TransUNet: A novel encoder-decoder architecture with vision transformer for improved medical image segmentation

MS-TCNet: An effective Transformer–CNN combined network using multi-scale feature learning for 3D medical image segmentation

ViT-UperNet: a hybrid vision transformer with unified-perceptual-parsing network for medical image segmentation

CascadeMedSeg: integrating pyramid vision transformer with multi-scale fusion for precise medical image segmentation

ViTBIS: Vision Transformer for Biomedical Image Segmentation

CiT-Net: Convolutional Neural Networks Hand in Hand with Vision Transformers for Medical Image Segmentation

TranSegNet: Hybrid CNN-Vision Transformers Encoder for Retina Segmentation of Optical Coherence Tomography

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation