Cross Attention Multi Scale CNN-Transformer Hybrid Encoder is General Medical Image Learner.

Rongzhou Zhou,Junfeng Yao,Qingqi Hong,Xingxin Li,Xianpeng Cao
DOI: https://doi.org/10.1007/978-981-99-8558-6_8
2024-01-01
Abstract:Medical image segmentation plays a crucial role in medical artificial intelligence. Recent advancements in computer vision have introduced multiscale ViT (Vision Transformer), revealing its robustness and superior feature extraction capabilities. However, the independent processing of data patches by ViT often leads to insufficient attention to fine details. In medical image segmentation tasks like organ and tumor segmentation, precise boundary delineation is of utmost importance. To address this challenge, this study proposes two novel CNN-Transformer feature fusion modules: SFM (Shallow Fusion Module) and DFM (Deep Fusion Module). These modules effectively integrate high-level and low-level semantic information from the feature pyramid while maintaining network efficiency. To expedite network convergence, the Deep Supervise method is introduced during the training phase. Additionally, extensive ablation experiments and comparative studies are conducted on well-known public datasets, namely Synapse and ACDC, to evaluate the effectiveness of the proposed approach. The experimental results not only demonstrate the efficacy of the proposed modules and training method but also showcase the superiority of our architecture compared to previous methods. The code and trained models will be available soon.
What problem does this paper attempt to address?