Abstract:Recent advancements in vision-language pre-training via contrastive learning have significantly improved performance across computer vision tasks. However, in the medical domain, obtaining multimodal data is often costly and challenging due to privacy, sensitivity, and annotation complexity. To mitigate data scarcity while boosting model performance, we introduce \textbf{Uni-Mlip}, a unified self-supervision framework specifically designed to enhance medical vision-language pre-training. Uni-Mlip seamlessly integrates cross-modality, uni-modality, and fused-modality self-supervision techniques at the data-level and the feature-level. Additionally, Uni-Mlip tailors uni-modal image self-supervision to accommodate the unique characteristics of medical images. Our experiments across datasets of varying scales demonstrate that Uni-Mlip significantly surpasses current state-of-the-art methods in three key downstream tasks: image-text retrieval, image classification, and visual question answering (VQA).

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges faced by pre - training of multimodal data (images and texts) in the medical field. Specifically, the author points out that obtaining multimodal data in the medical field is often very expensive and difficult due to issues such as data privacy, sensitivity, and annotation complexity. These problems lead to data scarcity, which limits the performance improvement of the model. To solve these problems, the author proposes a unified self - supervised framework named **Uni - Mlip**, which is specifically designed to enhance medical vision - language pre - training. The main goals of Uni - Mlip to improve model performance are as follows: 1. **Self - supervision of cross - modal, unimodal, and fused modalities**: Uni - Mlip seamlessly integrates these techniques at the data level and feature level to more effectively mine medical data and learn valuable aligned features for downstream tasks. 2. **Adapt to the unique characteristics of medical images**: Uni - Mlip is specifically optimized for the characteristics of medical images to ensure that it can handle the subtle patterns and anomalies in medical images. 3. **Improve the performance of downstream tasks**: Experiments show that Uni - Mlip significantly outperforms the existing state - of - the - art methods in multiple key downstream tasks such as image - text retrieval, image classification, and visual question answering. ### Core contributions of Uni - Mlip 1. **First systematic exploration of self - supervision at the feature level and data level**: Uni - Mlip is the first framework to systematically explore self - supervision at the feature level and data level in unimodal and multimodal settings to consistently align image and text modalities. 2. **Adapt to the high - precision and detail - sensitivity of medical images**: Uni - Mlip adapts to the high - precision and detail - sensitivity requirements specific to medical images. 3. **Comprehensive experimental verification**: The experimental results prove the superiority of Uni - Mlip in tasks such as image - text retrieval, image classification, and visual question answering. Through these improvements, Uni - Mlip provides a general medical vision - language pre - training model that can maintain strong performance in various downstream tasks.

Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification

UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering

Medical Vision-Language Pre-Training for Brain Abnormalities

MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic Cross-modal Learnable Prompts

Self-supervised multi-modal training from uncurated images and reports enables monitoring AI in radiology

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

MLIP: Medical Language-Image Pre-training with Masked Local Representation Learning

Unified 2D and 3D Pre-training for Medical Image Classification and Segmentation.

Self-supervised vision-language pretraining for Medical visual question answering

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning

A Unified Visual Information Preservation Framework for Self-supervised Pre-training in Medical Image Analysis