Abstract:Pretrained language models such as Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art performance in natural language processing (NLP) tasks. Recently, BERT has been adapted to the biomedical domain. Despite the effectiveness, these models have hundreds of millions of parameters and are computationally expensive when applied to large-scale NLP applications. We hypothesized that the number of parameters of the original BERT can be dramatically reduced with minor impact on performance. In this study, we present Bioformer, a compact BERT model for biomedical text mining. We pretrained two Bioformer models (named Bioformer8L and Bioformer16L) which reduced the model size by 60% compared to BERTBase. Bioformer uses a biomedical vocabulary and was pre-trained from scratch on PubMed abstracts and PubMed Central full-text articles. We thoroughly evaluated the performance of Bioformer as well as existing biomedical BERT models including BioBERT and PubMedBERT on 15 benchmark datasets of four different biomedical NLP tasks: named entity recognition, relation extraction, question answering and document classification. The results show that with 60% fewer parameters, Bioformer16L is only 0.1% less accurate than PubMedBERT while Bioformer8L is 0.9% less accurate than PubMedBERT. Both Bioformer16L and Bioformer8L outperformed BioBERTBase-v1.1. In addition, Bioformer16L and Bioformer8L are 2-3 fold as fast as PubMedBERT/BioBERTBase-v1.1. Bioformer has been successfully deployed to PubTator Central providing gene annotations over 35 million PubMed abstracts and 5 million PubMed Central full-text articles. We make Bioformer publicly available via <a class="link-external link-https" href="https://github.com/WGLab/bioformer" rel="external noopener nofollow">this https URL</a>, including pre-trained models, datasets, and instructions for downstream use.

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

An Extensive Benchmark Study on Biomedical Text Generation and Mining with ChatGPT

BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks

GeneGPT: augmenting large language models with domain tools for improved access to biomedical information

Bioformer: an efficient transformer language model for biomedical text mining

scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

Evaluating GPT and BERT models for protein-protein interaction identification in biomedical text

GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain

Evaluation of GPT and BERT-based models on identifying proteinprotein interactions in biomedical text

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT