MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning
Chang Shu,Yi Zhu,Xiaochu Tang,Jing Xiao,Youxin Chen,Xiu Li,Qian Zhang,Zheng Lu
DOI: https://doi.org/10.1016/j.eswa.2023.121526
IF: 8.5
2023-11-02
Expert Systems with Applications
Abstract:Recently multimodal medical pretraining models play a significant role in automatic medical image and text analysis that has wide social and economical impact in healthcare. Despite being able to be quickly transferred to downstream tasks, the models are greatly limited due to the fact that these models can only be pretrained with professional medical image–text datasets, which usually contain a very small number of samples. In this work we propose MITER (Medical Image–Text Joint adaptive Pretraining), a joint adaptive pretraining framework via multi-level contrastive learning to overcome this limitation by pretraining image and text models for medical domain and utilizing existing models pretrained on generic data, which contain enormous number of samples. MITER features two types of objectives to solve the problem. The first type is uni-modal objectives that pretrain the models with medical images and text separately on uni-modal tasks. The other type is a cross-modal objective that pretrains jointly, allowing the models to influence each other on cross-modal tasks. We also introduce a strategy to dynamically select hard negative samples during the training process for better performance. Experimental results over four medical tasks, image-report retrieval, multi-label image classification, visual question answering, and report generation, show that our MITER framework solves the limitation problem by greatly outperforming existing benchmark models on all the tasks. The source code of our framework is available online. 2
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science