Abstract:Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to improve semantic retrieval performance in Biomedical Information Retrieval (IR) without a large amount of annotated data. Specifically, most existing biomedical information retrieval systems mainly rely on keyword matching, a traditional method that easily misses semantically relevant but lexically non-overlapping articles. While deep learning-based dense retrieval models perform better, they require a large amount of query-article pair annotated data, which is difficult to obtain in the biomedical field. To solve this problem, the authors propose **MedCPT** (bioMedical Contrastive Pre-trained Transformers), a contrastive learning-based pre-trained transformer model specifically designed for zero-shot biomedical information retrieval. MedCPT is trained using a large-scale PubMed user click log and employs a contrastive learning method to train an integrated retriever and re-ranker, achieving state-of-the-art performance in multiple biomedical information retrieval tasks. ### Main Contributions: 1. **Large-scale Dataset**: Collected 255 million user click logs for model training. 2. **Contrastive Learning**: Trained an integrated retriever and re-ranker using a contrastive learning method, improving the model's generalization ability. 3. **Zero-shot Performance**: Achieved state-of-the-art performance in multiple biomedical information retrieval tasks under zero-shot settings, surpassing various baseline models, including larger-scale models. 4. **Multi-task Application**: MedCPT excels not only in document retrieval tasks but also in sentence representation and article representation tasks. ### Experimental Results: - **Document Retrieval**: MedCPT outperformed existing models in three independent biomedical tasks and the overall average performance on the BEIR benchmark, including Google's GTR-XXL and OpenAI's cpt-text-XL. - **Article Representation**: MedCPT achieved new best performance on the RELISH similar article dataset and the MeSH prediction task of SciDocs. - **Sentence Representation**: MedCPT performed best or second best in the BIOSESS and MedSTS semantic evaluation tasks. ### Application Prospects: - **Literature Search**: Enhance the performance of biomedical literature search engines like PubMed, improving the relevance of retrieval results. - **Similar Article Recommendation**: Improve similar article recommendation algorithms in literature search. - **Sentence-level Retrieval**: Promote sentence-level literature search tasks, such as sentence-to-sentence retrieval. In summary, MedCPT successfully addresses key issues in biomedical information retrieval by leveraging large-scale PubMed user click logs for contrastive learning, significantly improving zero-shot performance and having broad application prospects.

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Advancing PICO Element Detection in Biomedical Text via Deep Neural Networks

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents

BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation

C2BERT - Cross-contrast BERT for Chinese Biomedical Sentence Representation.

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Medi-CAT: Contrastive Adversarial Training for Medical Image Classification

MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Contrastive Learning of Medical Visual Representations from Paired Images and Text

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning

BiomedParse: a biomedical foundation model for image parsing of everything everywhere all at once

MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Comparative Evaluation of Pre-Trained Language Models for Biomedical Information Retrieval

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining