SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation

Quan-Dung Pham,Hai Nguyen-Truong,Nam Nguyen Phuong,Khoa N. A. Nguyen
DOI: https://doi.org/10.1109/ISBI52829.2022.9761417
2023-09-30
Abstract:Current research on deep learning for medical image segmentation exposes their limitations in learning either global semantic information or local contextual information. To tackle these issues, a novel network named SegTransVAE is proposed in this paper. SegTransVAE is built upon encoder-decoder architecture, exploiting transformer with the variational autoencoder (VAE) branch to the network to reconstruct the input images jointly with segmentation. To the best of our knowledge, this is the first method combining the success of CNN, transformer, and VAE. Evaluation on various recently introduced datasets shows that SegTransVAE outperforms previous methods in Dice Score and $95\%$-Haudorff Distance while having comparable inference time to a simple CNN-based architecture network. The source code is available at: <a class="link-external link-https" href="https://github.com/itruonghai/SegTransVAE" rel="external noopener nofollow">this https URL</a>.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes a new network architecture called SegTransVAE, aimed at addressing some limitations of existing medical image segmentation methods, particularly the shortcomings of current deep learning methods in learning global semantic information or local contextual information. SegTransVAE combines the advantages of Convolutional Neural Networks (CNN), Transformers, and Variational Autoencoders (VAE). Specifically, SegTransVAE adopts an encoder-decoder architecture and introduces a VAE branch to regularize the network, allowing for joint reconstruction of the input image and segmentation tasks. This architecture enables the network to avoid overfitting issues while utilizing CNNs to extract local 3D contextual information and Transformers to model global features, thereby improving the learning ability of long-range dependencies. Additionally, the network leverages positional embeddings to retain spatial information and uses a feature mapping module to convert the Transformer's output back into standard feature map form. The experimental section demonstrates SegTransVAE's superior performance on two datasets—BraTS 2021 (brain tumor MRI dataset) and KiTS19 (kidney tumor CT dataset)—compared to several existing methods (such as 3D U-Net, UNETR, and SegresnetVAE) in terms of evaluation metrics like Dice score and 95% Haussdorff distance. Moreover, SegTransVAE also shows advantages in network complexity (number of parameters and average inference time), especially compared to UNETR, as it has fewer parameters and outperforms the latter in all evaluation metrics. Overall, SegTransVAE improves segmentation accuracy while maintaining low complexity, exhibiting better generalization ability particularly in scenarios with limited training data.