Abstract:The medical literature contains valuable knowledge, such as the clinical symptoms, diagnosis, and treatments of a particular disease. Named Entity Recognition (NER) is the initial step in extracting this knowledge from unstructured text and presenting it as a Knowledge Graph (KG). However, the previous approaches of NER have often suffered from small-scale human-labelled training data. Furthermore, extracting knowledge from Chinese medical literature is a more complex task because there is no segmentation between Chinese characters. Recently, the pretraining models, which obtain representations with the prior semantic knowledge on large-scale unlabelled corpora, have achieved state-of-the-art results for a wide variety of Natural Language Processing (NLP) tasks. However, the capabilities of pretraining models have not been fully exploited, and applications of other pretraining models except BERT in specific domains, such as NER in Chinese medical literature, are also of interest. In this paper, we enhance the performance of NER in Chinese medical literature using pretraining models. First, we propose a method of data augmentation by replacing the words in the training set with synonyms through the Mask Language Model (MLM), which is a pretraining task. Then, we consider NER as the downstream task of the pretraining model and transfer the prior semantic knowledge obtained during pretraining to it. Finally, we conduct experiments to compare the performances of six pretraining models (BERT, BERT-WWM, BERT-WWM-EXT, ERNIE, ERNIE-tiny, and RoBERTa) in recognizing named entities from Chinese medical literature. The effects of feature extraction and fine-tuning, as well as different downstream model structures, are also explored. Experimental results demonstrate that the method of data augmentation we proposed can obtain meaningful improvements in the performance of recognition. Besides, RoBERTa-CRF achieves the highest F 1-score compared with the previous methods and other pretraining models.

ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.

ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

Ernie: Enhanced Language Representation With Informative Entities

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

An ERNIE-Based Joint Model for Chinese Named Entity Recognition

ERNIE: Enhanced Representation through Knowledge Integration

ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graphs

ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations

Coarse-to-Fine Pre-training for Named Entity Recognition

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

Named Entity Recognition in Chinese Medical Literature Using Pretraining Models

NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data