Multimodal self-supervised learning for lesion localization

Hao Yang,Hong-Yu Zhou,Cheng Li,Weijian Huang,Jiarun Liu,Yong Liang,Guangming Shi,Hairong Zheng,Qiegen Liu,Shanshan Wang
2024-08-20
Abstract:Multimodal deep learning utilizing imaging and diagnostic reports has made impressive progress in the field of medical imaging diagnostics, demonstrating a particularly strong capability for auxiliary diagnosis in cases where sufficient annotation information is lacking. Nonetheless, localizing diseases accurately without detailed positional annotations remains a challenge. Although existing methods have attempted to utilize local information to achieve fine-grained semantic alignment, their capability in extracting the fine-grained semantics of the comprehensive context within reports is limited. To address this problem, a new method is introduced that takes full sentences from textual reports as the basic units for local semantic alignment. This approach combines chest X-ray images with their corresponding textual reports, performing contrastive learning at both global and local levels. The leading results obtained by this method on multiple datasets confirm its efficacy in the task of lesion localization.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the problem of how to achieve precise localization of lesion areas in medical image diagnosis through a multimodal self-supervised learning method without detailed location annotations. Specifically, although existing methods can utilize local information for fine-grained semantic alignment, they are limited in their ability to extract comprehensive contextual fine-grained semantics from reports. Therefore, this paper proposes a new method that performs local semantic alignment by using complete sentences from text reports as basic units, combining chest X-ray images and corresponding text reports, and conducting contrastive learning at both global and local levels to improve the accuracy of lesion localization. ### Main Issues: 1. **Lack of detailed location annotations**: Existing medical imaging data often lack detailed lesion location annotations, making precise lesion localization difficult. 2. **Insufficient fine-grained semantic alignment**: Existing methods are limited in their ability to extract comprehensive contextual fine-grained semantics from reports, leading to low accuracy in lesion localization. 3. **Handling unseen diseases**: Existing methods perform poorly when dealing with unseen diseases, requiring more adjustments or fine-tuning. ### Solutions: - **Multimodal self-supervised learning**: Combining chest X-ray images and corresponding text reports, using global and local contrastive learning to achieve precise localization of lesion areas. - **Sentence-level semantic alignment**: Using complete sentences from text reports as basic units for local semantic alignment, rather than traditional word-level alignment, to better capture the complete meaning of lesion descriptions. - **Joint optimization of global and local features**: Learning shared latent semantic representations through joint optimization of global contrastive loss and local contrastive loss to achieve fine-grained semantic alignment. ### Experimental Results: - Experimental results on multiple datasets show that the proposed method performs excellently in lesion localization tasks, significantly outperforming existing methods, especially when dealing with unseen diseases. - Specific metrics such as Dice coefficient and IoU (Intersection over Union) have improved, validating the effectiveness and robustness of the proposed method. In summary, this paper addresses the challenge of lesion localization in medical imaging by proposing a new multimodal self-supervised learning method, improving the accuracy and robustness of lesion localization, particularly excelling in handling unseen diseases.