ASIMSA: Advanced Semantic Information Guided Multi-Scale Alignment Framework for Medical Vision-Language Pretraining

Shuai Xiao,Yangyang Zhang,Liming Jiang,Zhengxia Wang
DOI: https://doi.org/10.1109/iccia62557.2024.10719240
2024-01-01
Abstract:Medical Visual Language Pretraining (MVLP) utilizes textual reports for weak supervision to improve the learning of medical visual representations, showing promise in various medical image analysis tasks. While previous studies have focused on image and report correspondences within the same instances, they have overlooked high-level semantic associations, such as disease-level semantic correspondences, across different instances. To address this issue, we propose an Advanced Semantic Information Guided Multi-Scale Alignment Framework (ASIMSA) with two main modules: 1) Image-Report Alignment (IRA) module for learning shared representations between image and report of the same instance, and 2) Disease-Entity Alignment (DEA) module for learning disease entity correspondences across different instances. Extensive experiments on zero-shot classification, fine-tune classification, and semantic segmentation tasks validate the effectiveness of our proposed method, demonstrating its stable and superior performance.
What problem does this paper attempt to address?