BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang,Yanbo Xu,Naoto Usuyama,Hanwen Xu,Jaspreet Bagga,Robert Tinn,Sam Preston,Rajesh Rao,Mu Wei,Naveen Valluri,Cliff Wong,Andrea Tupini,Yu Wang,Matt Mazzola,Swadheen Shukla,Lars Liden,Jianfeng Gao,Matthew P. Lungren,Tristan Naumann,Sheng Wang,Hoifung Poon
2024-01-17
Abstract:Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at <a class="link-external link-https" href="https://aka.ms/biomedclip" rel="external noopener nofollow">this https URL</a> to facilitate future research in multimodal biomedical AI.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problems this paper attempts to address are: 1. **Processing of Multimodal Biomedical Data**: Biomedical data is inherently multimodal, including both physical measurements and natural language descriptions. Existing biomedical multimodal models have limitations in terms of data volume, diversity, and openness, which restrict their generalization ability and performance. 2. **Acquisition of High-Quality Multimodal Data**: Compared to general-domain vision-language pre-training models, existing biomedical vision-language models face three main issues with their pre-training data: (1) Much of the data is private, making many foundational biomedical models inaccessible; (2) Existing parallel image-text datasets are relatively small, ranging from a few thousand to hundreds of thousands of pairs; (3) Existing datasets lack diversity, with most focusing on chest X-rays, limiting their generalization ability to other types of biomedical images. 3. **Development of High-Performance Biomedical Vision-Language Foundation Models**: To overcome the above issues, the paper proposes constructing a large-scale, high-quality parallel image-text dataset (PMC-15M) and pre-training an advanced biomedical vision-language foundation model (BiomedCLIP) based on this dataset. This model aims to achieve state-of-the-art performance in various downstream tasks, including cross-modal retrieval, zero-shot image classification, and medical visual question answering. By addressing these issues, the paper aims to advance research in biomedical multimodal studies, providing powerful tools and support for future biomedical applications.