BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition

Pavlova Vera,Mohammed Makhlouf
DOI: https://doi.org/10.48550/arXiv.2308.08625
2023-08-17
Abstract:Using language models (LMs) pre-trained in a self-supervised setting on large corpora and then fine-tuning for a downstream task has helped to deal with the problem of limited label data for supervised learning tasks such as Named Entity Recognition (NER). Recent research in biomedical language processing has offered a number of biomedical LMs pre-trained using different methods and techniques that advance results on many BioNLP tasks, including NER. However, there is still a lack of a comprehensive comparison of pre-training approaches that would work more optimally in the biomedical domain. This paper aims to investigate different pre-training methods, such as pre-training the biomedical LM from scratch and pre-training it in a continued fashion. We compare existing methods with our proposed pre-training method of initializing weights for new tokens by distilling existing weights from the BERT model inside the context where the tokens were found. The method helps to speed up the pre-training stage and improve performance on NER. In addition, we compare how masking rate, corruption strategy, and masking strategies impact the performance of the biomedical LM. Finally, using the insights from our experiments, we introduce a new biomedical LM (BIOptimus), which is pre-trained using Curriculum Learning (CL) and contextualized weight distillation method. Our model sets new states of the art on several biomedical Named Entity Recognition (NER) tasks. We release our code and all pre-trained models
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to optimize the pre - training methods of language models in the biomedical field to improve their performance on the named entity recognition (NER) task. Specifically, the author focuses on the following aspects: 1. **Lack of comprehensive pre - training method comparison**: Although there are already some language models for the biomedical field (such as BioBERT, PubMedBERT, etc.), the pre - training methods of these models have not been systematically compared and evaluated. Therefore, the author hopes to find a better pre - training strategy by comparing different pre - training methods. 2. **Vocabulary problem**: Existing pre - training methods often lack domain - specific vocabularies when continuing pre - training, which will affect the model's understanding and processing of in - domain terms. The author proposes a new method of initializing weights, by distilling contextualized weights from existing BERT models to solve this problem. 3. **The influence of masking strategies and masking rates**: Different masking strategies and masking rates have a great impact on the pre - training effect of language models. The author compares a variety of masking strategies and masking rates through experiments to determine the optimal settings. 4. **The application of curriculum learning (CL)**: The author introduces the method of curriculum learning, gradually increasing the complexity of tasks to help the model learn better. This method has been proven to improve the performance of the model in the pre - training stage. ### Main contributions - Propose a new method of initializing weights, that is, initializing word vectors in the new vocabulary by distilling contextualized weights in existing BERT models. - Systematically compare different pre - training methods, including pre - training from scratch, continuing pre - training and hybrid pre - training, and analyze their performance on the NER task. - Explore the influence of different masking strategies and masking rates on the pre - training effect. - Introduce the method of curriculum learning, gradually increasing the complexity of pre - training tasks, thereby improving the performance of the model. ### Experimental results The author verifies their method through multiple NER datasets (such as BC5 - chem, BC5 - disease, NCBI - disease, etc.). The results show that the model (BIOptimus 0.4) pre - trained using the curriculum learning and contextualized weight distillation methods has achieved new best results (state - of - the - art) on multiple datasets, especially on the BC5 - chem, NCBI - disease and BC2GM datasets. In short, this paper aims to optimize the pre - training methods of language models in the biomedical field through systematic research and experiments, thereby improving their performance on the named entity recognition task.