Abstract:Background Pretraining large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing. With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from free text has improved significantly in both the general and medical domains. However, it is difficult to train specific BERT models to perform well in domains for which few databases of a high quality and large size are publicly available. Objective We hypothesized that this problem could be addressed by oversampling a domain-specific corpus and using it for pretraining with a larger corpus in a balanced manner. In the present study, we verified our hypothesis by developing pretraining models using our method and evaluating their performance. Methods Our proposed method was based on the simultaneous pretraining of models with knowledge from distinct domains after oversampling. We conducted three experiments in which we generated (1) English biomedical BERT from a small biomedical corpus, (2) Japanese medical BERT from a small medical corpus, and (3) enhanced biomedical BERT pretrained with complete PubMed abstracts in a balanced manner. We then compared their performance with those of conventional models. Results Our English BERT pretrained using both general and small medical domain corpora performed sufficiently well for practical use on the biomedical language understanding evaluation (BLUE) benchmark. Moreover, our proposed method was more effective than the conventional methods for each biomedical corpus of the same corpus size in the general domain. Our Japanese medical BERT outperformed the other BERT models built using a conventional method for almost all the medical tasks. The model demonstrated the same trend as that of the first experiment in English. Further, our enhanced biomedical BERT model, which was not pretrained on clinical notes, achieved superior clinical and biomedical scores on the BLUE benchmark with an increase of 0.3 points in the clinical score and 0.5 points in the biomedical score. These scores were above those of the models trained without our proposed method. Conclusions Well-balanced pretraining using oversampling instances derived from a corpus appropriate for the target task allowed us to construct a high-performance BERT model.

How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

The Effects of In-domain Corpus Size on pre-training BERT

CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks

Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry

Adapting Large Language Models to Domains via Reading Comprehension

Pretrained domain-specific language model for natural language processing tasks in the AEC domain

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Downstream Datasets Make Surprisingly Good Pretraining Corpora

Can Fine-tuning Pre-trained Models Lead to Perfect NLP? A Study of the Generalizability of Relation Extraction.

Domain-Specific Pretraining of Language Models: A Comparative Study in the Medical Field

CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks

How Important is Domain Specificity in Language Models and Instruction Finetuning for Biomedical Relation Extraction?

Pre-training technique to localize medical BERT and enhance biomedical BERT

Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI

Domain-specific language models pre-trained on construction management systems corpora

Does your data spark joy? Performance gains from domain upsampling at the end of training

Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT

Large language model enhanced corpus of CO 2 reduction electrocatalysts and synthesis procedures