SAM-Swin: SAM-Driven Dual-Swin Transformers with Adaptive Lesion Enhancement for Laryngo-Pharyngeal Tumor Detection

Jia Wei,Yun Li,Xiaomao Fan,Wenjun Ma,Meiyu Qiu,Hongyu Chen,Wenbin Lei
2024-10-29
Abstract:Laryngo-pharyngeal cancer (LPC) is a highly lethal malignancy in the head and neck region. Recent advancements in tumor detection, particularly through dual-branch network architectures, have significantly improved diagnostic accuracy by integrating global and local feature extraction. However, challenges remain in accurately localizing lesions and fully capitalizing on the complementary nature of features within these branches. To address these issues, we propose SAM-Swin, an innovative SAM-driven Dual-Swin Transformer for laryngo-pharyngeal tumor detection. This model leverages the robust segmentation capabilities of the Segment Anything Model 2 (SAM2) to achieve precise lesion segmentation. Meanwhile, we present a multi-scale lesion-aware enhancement module (MS-LAEM) designed to adaptively enhance the learning of nuanced complementary features across various scales, improving the quality of feature extraction and representation. Furthermore, we implement a multi-scale class-aware guidance (CAG) loss that delivers multi-scale targeted supervision, thereby enhancing the model's capacity to extract class-specific features. To validate our approach, we compiled three LPC datasets from the First Affiliated Hospital (FAHSYSU), the Sixth Affiliated Hospital (SAHSYSU) of Sun Yat-sen University, and Nanfang Hospital of Southern Medical University (NHSMU). The FAHSYSU dataset is utilized for internal training, while the SAHSYSU and NHSMU datasets serve for external evaluation. Extensive experiments demonstrate that SAM-Swin outperforms state-of-the-art methods, showcasing its potential for advancing LPC detection and improving patient outcomes. The source code of SAM-Swin is available at the URL of \href{<a class="link-external link-https" href="https://github.com/VVJia/SAM-Swin" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/VVJia/SAM-Swin" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in laryngo - pharyngeal cancer (LPC) detection: 1. **Accurate lesion localization**: In endoscopic images, the similarity between the lesion area and the background makes it difficult for the model to accurately distinguish and localize the lesion. This may lead to a decline in the classification performance in the subsequent network stages because of the introduction of irrelevant or misleading information. 2. **Fully exploiting the complementarity of global and local features**: Existing two - branch networks usually adopt a simple splicing method to fuse global and local features. This method may not fully utilize the complementarity of these features, thus limiting the effectiveness of feature fusion. 3. **Improving the quality of feature extraction and representation**: In order to detect and classify laryngo - pharyngeal cancer more accurately, it is necessary to enhance the learning of subtle complementary features at different scales to improve the quality of feature extraction and representation. To solve these problems, the authors propose a novel framework named SAM - Swin, which specifically includes the following aspects: - **SAM2 - Guided Lesion Location Module (SAM2 - GLLM)**: Utilize the powerful segmentation ability of Segment Anything Model 2 (SAM2) to achieve high - precision lesion area segmentation. - **Multi - scale Lesion - aware Enhancement Module (MS - LAEM)**: Adaptively enhance the learning of subtle complementary features at different scales and improve the quality of feature extraction and representation. - **Multi - scale Class - aware Guidance Loss (CAG Loss)**: Through multi - scale target supervision, improve the model's ability to extract class - specific features, thereby enhancing the model's discrimination ability among different tumor classes. Through these innovations, SAM - Swin can perform excellently in the complex laryngo - pharyngeal cancer detection task, significantly improving the accuracy and reliability of diagnosis and helping to improve the prognosis and quality of life of patients. ### Formula summary - **Overall objective loss function**: \[ L_{\text{total}} = L_{\text{cls}}+L_w + L_l \] where: - \(L_{\text{cls}}\) is the final classification loss. - \(L_w\) and \(L_l\) are the global and local CAG losses respectively. - **Multi - scale CAG loss**: \[ L_w=\sum_{i = 1}^{4}(2^{i - 1}\alpha)L_{si}^w(\hat{y}_{si}^w,y) \] \[ L_l=\sum_{i = 1}^{4}(2^{i - 1}\alpha)L_{si}^l(\hat{y}_{si}^l,y) \] where: - \(L_{si}^w(\cdot)\) and \(L_{si}^l(\cdot)\) are the cross - entropy losses on WIB and LRB at the \(i\) - th stage respectively. - \(\alpha\) is a crucial trade - off hyperparameter. - **Classification loss**: \[ L_{\text{cls}}=\text{CrossEntropyLoss}(\hat{y}_{\text{cls}},y) \] These formulas ensure that the model can effectively balance the contributions of different aspects during the training process, thereby achieving robust laryngo - pharyngeal cancer detection and classification.