Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training

Ameera Bawazir,Kebin Wu,Wenbin Li
2024-11-20
Abstract:Recent advancements in vision-language pre-training via contrastive learning have significantly improved performance across computer vision tasks. However, in the medical domain, obtaining multimodal data is often costly and challenging due to privacy, sensitivity, and annotation complexity. To mitigate data scarcity while boosting model performance, we introduce \textbf{Uni-Mlip}, a unified self-supervision framework specifically designed to enhance medical vision-language pre-training. Uni-Mlip seamlessly integrates cross-modality, uni-modality, and fused-modality self-supervision techniques at the data-level and the feature-level. Additionally, Uni-Mlip tailors uni-modal image self-supervision to accommodate the unique characteristics of medical images. Our experiments across datasets of varying scales demonstrate that Uni-Mlip significantly surpasses current state-of-the-art methods in three key downstream tasks: image-text retrieval, image classification, and visual question answering (VQA).
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges faced by pre - training of multimodal data (images and texts) in the medical field. Specifically, the author points out that obtaining multimodal data in the medical field is often very expensive and difficult due to issues such as data privacy, sensitivity, and annotation complexity. These problems lead to data scarcity, which limits the performance improvement of the model. To solve these problems, the author proposes a unified self - supervised framework named **Uni - Mlip**, which is specifically designed to enhance medical vision - language pre - training. The main goals of Uni - Mlip to improve model performance are as follows: 1. **Self - supervision of cross - modal, unimodal, and fused modalities**: Uni - Mlip seamlessly integrates these techniques at the data level and feature level to more effectively mine medical data and learn valuable aligned features for downstream tasks. 2. **Adapt to the unique characteristics of medical images**: Uni - Mlip is specifically optimized for the characteristics of medical images to ensure that it can handle the subtle patterns and anomalies in medical images. 3. **Improve the performance of downstream tasks**: Experiments show that Uni - Mlip significantly outperforms the existing state - of - the - art methods in multiple key downstream tasks such as image - text retrieval, image classification, and visual question answering. ### Core contributions of Uni - Mlip 1. **First systematic exploration of self - supervision at the feature level and data level**: Uni - Mlip is the first framework to systematically explore self - supervision at the feature level and data level in unimodal and multimodal settings to consistently align image and text modalities. 2. **Adapt to the high - precision and detail - sensitivity of medical images**: Uni - Mlip adapts to the high - precision and detail - sensitivity requirements specific to medical images. 3. **Comprehensive experimental verification**: The experimental results prove the superiority of Uni - Mlip in tasks such as image - text retrieval, image classification, and visual question answering. Through these improvements, Uni - Mlip provides a general medical vision - language pre - training model that can maintain strong performance in various downstream tasks.