CiT-Net: Convolutional Neural Networks Hand in Hand with Vision Transformers for Medical Image Segmentation

Tao Lei,Rui Sun,Xuan Wang,Yingbo Wang,Xi He,Asoke Nandi
DOI: https://doi.org/10.24963/ijcai.2023/113
2023-12-20
Abstract:The hybrid architecture of convolutional neural networks (CNNs) and Transformer are very popular for medical image segmentation. However, it suffers from two challenges. First, although a CNNs branch can capture the local image features using vanilla convolution, it cannot achieve adaptive feature learning. Second, although a Transformer branch can capture the global features, it ignores the channel and cross-dimensional self-attention, resulting in a low segmentation accuracy on complex-content images. To address these challenges, we propose a novel hybrid architecture of convolutional neural networks hand in hand with vision Transformers (CiT-Net) for medical image segmentation. Our network has two advantages. First, we design a dynamic deformable convolution and apply it to the CNNs branch, which overcomes the weak feature extraction ability due to fixed-size convolution kernels and the stiff design of sharing kernel parameters among different inputs. Second, we design a shifted-window adaptive complementary attention module and a compact convolutional projection. We apply them to the Transformer branch to learn the cross-dimensional long-term dependency for medical images. Experimental results show that our CiT-Net provides better medical image segmentation results than popular SOTA methods. Besides, our CiT-Net requires lower parameters and less computational costs and does not rely on pre-training. The code is publicly available at <a class="link-external link-https" href="https://github.com/SR0920/CiT-Net" rel="external noopener nofollow">this https URL</a>.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper introduces a novel hybrid architecture called CiT-Net (Convolutional Neural Networks hand in hand with Vision Transformers) for medical image segmentation. The authors aim to address two primary challenges faced by existing hybrid architectures of CNNs and Transformers: 1. **Weak Local Feature Extraction Ability**: While CNNs can capture local image features using vanilla convolutions, they lack adaptive feature learning capabilities. This limitation affects their ability to accurately represent deformed organs and irregular lesions. 2. **Inadequate Global Feature Expression**: Transformers can capture global features but ignore channel and cross-dimensional self-attention, leading to lower segmentation accuracy on complex medical images with dense noise and low contrast. To tackle these issues, the CiT-Net architecture incorporates the following innovations: ### Main Contributions 1. **Dynamic Deformable Convolution (DDConv)**: This module enables adaptive learning of the convolution kernel's weight coefficients and deformation offsets. DDConv addresses the fixed receptive field issue of vanilla convolutions and enhances the network's ability to perceive small and irregularly shaped targets in medical images. 2. **Shifted-Window Adaptive Complementary Attention Module (SW-ACAM)**: This module captures the cross-dimensional long-range dependency in medical images through four parallel branches of weight coefficient adaptive learning. SW-ACAM improves the separ