BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang,Yanbo Xu,Naoto Usuyama,Hanwen Xu,Jaspreet Bagga,Robert Tinn,Sam Preston,Rajesh Rao,Mu Wei,Naveen Valluri,Cliff Wong,Andrea Tupini,Yu Wang,Matt Mazzola,Swadheen Shukla,Lars Liden,Jianfeng Gao,Matthew P. Lungren,Tristan Naumann,Sheng Wang,Hoifung Poon

2024-01-17

Abstract:Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at <a class="link-external link-https" href="https://aka.ms/biomedclip" rel="external noopener nofollow">this https URL</a> to facilitate future research in multimodal biomedical AI.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

The problems this paper attempts to address are: 1. **Processing of Multimodal Biomedical Data**: Biomedical data is inherently multimodal, including both physical measurements and natural language descriptions. Existing biomedical multimodal models have limitations in terms of data volume, diversity, and openness, which restrict their generalization ability and performance. 2. **Acquisition of High-Quality Multimodal Data**: Compared to general-domain vision-language pre-training models, existing biomedical vision-language models face three main issues with their pre-training data: (1) Much of the data is private, making many foundational biomedical models inaccessible; (2) Existing parallel image-text datasets are relatively small, ranging from a few thousand to hundreds of thousands of pairs; (3) Existing datasets lack diversity, with most focusing on chest X-rays, limiting their generalization ability to other types of biomedical images. 3. **Development of High-Performance Biomedical Vision-Language Foundation Models**: To overcome the above issues, the paper proposes constructing a large-scale, high-quality parallel image-text dataset (PMC-15M) and pre-training an advanced biomedical vision-language foundation model (BiomedCLIP) based on this dataset. This model aims to achieve state-of-the-art performance in various downstream tasks, including cross-modal retrieval, zero-shot image classification, and medical visual question answering. By addressing these issues, the paper aims to advance research in biomedical multimodal studies, providing powerful tools and support for future biomedical applications.

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents

CLIP in Medical Imaging: A Comprehensive Survey

A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities

EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

BiomedParse: a biomedical foundation model for image parsing of everything everywhere all at once

Mammo-CLIP: Leveraging Contrastive Language-Image Pre-training (CLIP) for Enhanced Breast Cancer Diagnosis with Multi-view Mammography

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

MultiMed: Massively Multimodal and Multitask Medical Understanding

MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

Advancing Accuracy in Multimodal Medical Tasks Through Bootstrapped Language-Image Pretraining (BioMedBLIP): Performance Evaluation Study

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

OpenMEDLab: An Open-source Platform for Multi-modality Foundation Models in Medicine

VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image Analysis

Multimodal Foundation Models For Echocardiogram Interpretation

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine