Abstract:The medical literature contains valuable knowledge, such as the clinical symptoms, diagnosis, and treatments of a particular disease. Named Entity Recognition (NER) is the initial step in extracting this knowledge from unstructured text and presenting it as a Knowledge Graph (KG). However, the previous approaches of NER have often suffered from small-scale human-labelled training data. Furthermore, extracting knowledge from Chinese medical literature is a more complex task because there is no segmentation between Chinese characters. Recently, the pretraining models, which obtain representations with the prior semantic knowledge on large-scale unlabelled corpora, have achieved state-of-the-art results for a wide variety of Natural Language Processing (NLP) tasks. However, the capabilities of pretraining models have not been fully exploited, and applications of other pretraining models except BERT in specific domains, such as NER in Chinese medical literature, are also of interest. In this paper, we enhance the performance of NER in Chinese medical literature using pretraining models. First, we propose a method of data augmentation by replacing the words in the training set with synonyms through the Mask Language Model (MLM), which is a pretraining task. Then, we consider NER as the downstream task of the pretraining model and transfer the prior semantic knowledge obtained during pretraining to it. Finally, we conduct experiments to compare the performances of six pretraining models (BERT, BERT-WWM, BERT-WWM-EXT, ERNIE, ERNIE-tiny, and RoBERTa) in recognizing named entities from Chinese medical literature. The effects of feature extraction and fine-tuning, as well as different downstream model structures, are also explored. Experimental results demonstrate that the method of data augmentation we proposed can obtain meaningful improvements in the performance of recognition. Besides, RoBERTa-CRF achieves the highest F 1-score compared with the previous methods and other pretraining models.

Pre-Training with Whole Word Masking for Chinese BERT

An Improved Mask Approach Based on Pointer Network for Domain Adaptation of BERT

"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

RoBERTa-wwm-ext Fine-Tuning for Chinese Text Classification

CharBERT: Character-aware Pre-trained Language Model

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

Character, Word, or Both? Revisiting the Segmentation Granularity for Chinese Pre-trained Language Models

MarkBERT: Marking Word Boundaries Improves Chinese BERT

MarkBERT: Marking Word Boundaries Improves Chinese BERT.

MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

RoChBert: Towards Robust BERT Fine-tuning for Chinese

Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models

Using Selective Masking as a Bridge between Pre-training and Fine-tuning

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

MPNet: Masked and Permuted Pre-training for Language Understanding

Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis

End-to-End Speech Recognition with Pre-trained Masked Language Model

TCBERT: A Technical Report for Chinese Topic Classification BERT

Named Entity Recognition in Chinese Medical Literature Using Pretraining Models