H-DenseFormer: An Efficient Hybrid Densely Connected Transformer for Multimodal Tumor Segmentation

Jun Shi,Hongyu Kan,Shulan Ruan,Ziqi Zhu,Minfan Zhao,Liang Qiao,Zhaohui Wang,Hong An,Xudong Xue
2023-07-04
Abstract:Recently, deep learning methods have been widely used for tumor segmentation of multimodal medical images with promising results. However, most existing methods are limited by insufficient representational ability, specific modality number and high computational complexity. In this paper, we propose a hybrid densely connected network for tumor segmentation, named H-DenseFormer, which combines the representational power of the Convolutional Neural Network (CNN) and the Transformer structures. Specifically, H-DenseFormer integrates a Transformer-based Multi-path Parallel Embedding (MPE) module that can take an arbitrary number of modalities as input to extract the fusion features from different modalities. Then, the multimodal fusion features are delivered to different levels of the encoder to enhance multimodal learning representation. Besides, we design a lightweight Densely Connected Transformer (DCT) block to replace the standard Transformer block, thus significantly reducing computational complexity. We conduct extensive experiments on two public multimodal datasets, HECKTOR21 and PI-CAI22. The experimental results show that our proposed method outperforms the existing state-of-the-art methods while having lower computational complexity. The source code is available at <a class="link-external link-https" href="https://github.com/shijun18/H-DenseFormer" rel="external noopener nofollow">this https URL</a>.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address some limitations of existing methods in tumor segmentation in multimodal medical images, including: 1. **Insufficient representation capability**: Most existing methods have limited representation capability when dealing with multimodal data, making it difficult to fully extract and fuse features from different modalities. 2. **Limitation on the number of specific modalities**: Many existing methods can only handle a specific number of modalities, making them unsuitable for any number of modality inputs. 3. **High computational complexity**: Some methods have high computational complexity due to a large number of model parameters, affecting efficiency in practical applications. To overcome these issues, the authors propose a new efficient hybrid dense connection network (H-DenseFormer), which combines the advantages of Convolutional Neural Networks (CNN) and Transformer structures, aiming to improve the performance and computational efficiency of multimodal tumor segmentation. Specifically, H-DenseFormer addresses the problems through the following innovations: 1. **Multi-Path Parallel Embedding (MPE) module**: This module can handle any number of modality inputs and extract and fuse multimodal features, thereby enhancing the model's representation capability. 2. **Lightweight Dense Connection Transformer (DCT) block**: This module replaces the standard Transformer block, significantly reducing computational complexity while maintaining high performance. 3. **U-shaped encoder-decoder structure**: A U-shaped structure is adopted as the backbone of the segmentation network, improving the model's convergence speed and performance through multi-scale outputs and deep supervision loss. Experimental results show that H-DenseFormer achieves better performance than existing methods on two public multimodal datasets (HECKTOR21 and PI-CAI22) while having lower computational complexity. These results validate the effectiveness and superiority of H-DenseFormer in multimodal tumor segmentation tasks.