Ensemble Transfer Learning on Augmented Domain Resources for Oncological Named Entity Recognition in Chinese Clinical Records
Meifeng Zhou,Jindian Tan,Song Yang,Haixia Wang,Lin Wang,Zhifeng Xiao
DOI: https://doi.org/10.1109/access.2023.3299824
IF: 3.9
2023-08-09
IEEE Access
Abstract:Biomedical Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) and can help mine knowledge from massive clinical and diagnostic records. However, the biomedical NER task often undergoes a low-resource training setting due to the high cost of human annotation, limiting the capability of traditional NER models. In this study, we propose a two-stage learning pipeline to tackle the oncological NER task in Chinese language, which is a typical task lacking training resources. In the first stage, two base models pre-trained by Word to Vector (Word2Vec) and Bidirectional Embeddings Representations from Transformer (BERT) are fine tuned to obtain domains-specific word embeddings that serve as the input for the downstream NER task. In the second stage, we feed the word embeddings into a neural network that consists of a Bidirectional Long and Short Time Memory Recurrent Neural Network (BiLSTM) and Linear-chain Conditional Random Field (CRF) for end task training. Meanwhile, we utilize a substitution-based generative model for data augmentation (DA), aiming to enhance the quantity and diversity of the training data. Experiments show that our proposed learning pipeline demonstrates superior performance compared to other model alternatives under a low-resource setting. Specifically, results show that the proposed fine-tuning strategy, when conducted on an augmented domain resource, can effectively incorporate rich domain knowledge into the final NER model, presenting a great potential in boosting a model's predictive power with limited training data.
computer science, information systems,telecommunications,engineering, electrical & electronic