Medical Vision-Language Representation Learning with Cross-Modal Multi-Teacher Contrastive Distillation

Bingzhi Chen,Jiawei Zhu,Yishu Liu,Biqing Zeng,Jiahui Pan,Meirong Ding
DOI: https://doi.org/10.1109/icassp48485.2024.10447344
2024-01-01
Abstract:Medical vision-language representation learning has garnered considerable attention owing to its applicability to extracting generic representations from the image and text modality. However, it still remains challenging to acquire a more comprehensive understanding of intra- and inter-modal semantic knowledge. In this paper, we propose a Cross-Modal Multi-Teacher Contrastive Distillation (CMCD) architecture, which aims to comprehensively learn medical vision-language representation in a unified multi-teacher framework. Specifically, a cross-modal knowledge distillation (CKD) module is designed to refine reconstructed semantics under an additional supervision signal generated by momentum teachers from the other modality, achieving more robust semantic interaction across modalities. To better alleviate the heterogeneity and semantic gaps, the multi-level contrastive learning (MCL) module is conceived to align features of both intra- and inter-modal via contrastive learning from multi-level perspectives. Extensive experiments on two medical downstream tasks, i.e., Med-VQA and Med-ITC, demonstrate that our CMCD consistently outperforms the state-of-the-art methods.
What problem does this paper attempt to address?