MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Qiao Jin,Won Kim,Qingyu Chen,Donald C. Comeau,Lana Yeganova,W. John Wilbur,Zhiyong Lu
DOI: https://doi.org/10.1093/bioinformatics/btad651
2023-10-04
Abstract:Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.
Information Retrieval,Artificial Intelligence,Computation and Language,Quantitative Methods
What problem does this paper attempt to address?
The paper attempts to address the problem of how to improve semantic retrieval performance in Biomedical Information Retrieval (IR) without a large amount of annotated data. Specifically, most existing biomedical information retrieval systems mainly rely on keyword matching, a traditional method that easily misses semantically relevant but lexically non-overlapping articles. While deep learning-based dense retrieval models perform better, they require a large amount of query-article pair annotated data, which is difficult to obtain in the biomedical field. To solve this problem, the authors propose **MedCPT** (bioMedical Contrastive Pre-trained Transformers), a contrastive learning-based pre-trained transformer model specifically designed for zero-shot biomedical information retrieval. MedCPT is trained using a large-scale PubMed user click log and employs a contrastive learning method to train an integrated retriever and re-ranker, achieving state-of-the-art performance in multiple biomedical information retrieval tasks. ### Main Contributions: 1. **Large-scale Dataset**: Collected 255 million user click logs for model training. 2. **Contrastive Learning**: Trained an integrated retriever and re-ranker using a contrastive learning method, improving the model's generalization ability. 3. **Zero-shot Performance**: Achieved state-of-the-art performance in multiple biomedical information retrieval tasks under zero-shot settings, surpassing various baseline models, including larger-scale models. 4. **Multi-task Application**: MedCPT excels not only in document retrieval tasks but also in sentence representation and article representation tasks. ### Experimental Results: - **Document Retrieval**: MedCPT outperformed existing models in three independent biomedical tasks and the overall average performance on the BEIR benchmark, including Google's GTR-XXL and OpenAI's cpt-text-XL. - **Article Representation**: MedCPT achieved new best performance on the RELISH similar article dataset and the MeSH prediction task of SciDocs. - **Sentence Representation**: MedCPT performed best or second best in the BIOSESS and MedSTS semantic evaluation tasks. ### Application Prospects: - **Literature Search**: Enhance the performance of biomedical literature search engines like PubMed, improving the relevance of retrieval results. - **Similar Article Recommendation**: Improve similar article recommendation algorithms in literature search. - **Sentence-level Retrieval**: Promote sentence-level literature search tasks, such as sentence-to-sentence retrieval. In summary, MedCPT successfully addresses key issues in biomedical information retrieval by leveraging large-scale PubMed user click logs for contrastive learning, significantly improving zero-shot performance and having broad application prospects.