CNN-Transformer Rectified Collaborative Learning for Medical Image Segmentation

Lanhu Wu,Miao Zhang,Yongri Piao,Zhenyan Yao,Weibing Sun,Feng Tian,Huchuan Lu
2024-08-28
Abstract:Automatic and precise medical image segmentation (MIS) is of vital importance for clinical diagnosis and analysis. Current MIS methods mainly rely on the convolutional neural network (CNN) or self-attention mechanism (Transformer) for feature modeling. However, CNN-based methods suffer from the inaccurate localization owing to the limited global dependency while Transformer-based methods always present the coarse boundary for the lack of local emphasis. Although some CNN-Transformer hybrid methods are designed to synthesize the complementary local and global information for better performance, the combination of CNN and Transformer introduces numerous parameters and increases the computation cost. To this end, this paper proposes a CNN-Transformer rectified collaborative learning (CTRCL) framework to learn stronger CNN-based and Transformer-based models for MIS tasks via the bi-directional knowledge transfer between them. Specifically, we propose a rectified logit-wise collaborative learning (RLCL) strategy which introduces the ground truth to adaptively select and rectify the wrong regions in student soft labels for accurate knowledge transfer in the logit space. We also propose a class-aware feature-wise collaborative learning (CFCL) strategy to achieve effective knowledge transfer between CNN-based and Transformer-based models in the feature space by granting their intermediate features the similar capability of category perception. Extensive experiments on three popular MIS benchmarks demonstrate that our CTRCL outperforms most state-of-the-art collaborative learning methods under different evaluation metrics.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address key issues in Medical Image Segmentation (MIS). Current MIS methods mainly rely on Convolutional Neural Networks (CNN) or self-attention mechanisms (Transformers) for feature modeling. However, CNN-based methods struggle with accurate localization due to limited global dependencies, while Transformer-based methods result in coarse boundaries due to a lack of local emphasis. Although some CNN-Transformer hybrid methods attempt to improve performance by combining complementary local and global information, these methods introduce a large number of parameters, increasing computational costs. To overcome these issues, this paper proposes a CNN-Transformer Rectified Collaborative Learning (CTRCL) framework to enhance the performance of CNN-based and Transformer-based models in MIS tasks through bidirectional knowledge transfer. Specifically, the CTRCL framework includes two strategies: 1. **Rectified Log-space Collaborative Learning (RLCL)**: Achieves accurate log-space knowledge transfer by introducing real labels to select and rectify erroneous regions in student soft labels. 2. **Class-aware Feature-space Collaborative Learning (CFCL)**: Achieves effective feature-space knowledge transfer by endowing intermediate features with similar class-aware capabilities. ### Main Contributions - **First Attempt**: The CTRCL framework is the first to adopt a collaborative learning mechanism to learn stronger CNN-based and Transformer-based models through bidirectional knowledge transfer in log-space and feature-space. - **RLCL Strategy**: Proposes a rectified log-space collaborative learning strategy that ensures high-quality student soft labels by introducing real labels to select and rectify erroneous regions in student soft labels. - **CFCL Strategy**: Proposes a class-aware feature-space collaborative learning strategy that achieves effective feature-space knowledge transfer between heterogeneous networks by endowing intermediate features with similar class-aware capabilities. - **Experimental Validation**: Experiments conducted on three popular MIS benchmark datasets demonstrate that the CTRCL framework outperforms other collaborative learning methods across multiple evaluation metrics. Specifically, on the Kvasir-SEG dataset, CTRCL reduces the MAE metric of ResNet-50 and MiT-B2 by 42.93% and 31.23%, respectively. Through these innovations, the CTRCL framework significantly improves the accuracy and robustness of medical image segmentation.