Abstract:The medical literature contains valuable knowledge, such as the clinical symptoms, diagnosis, and treatments of a particular disease. Named Entity Recognition (NER) is the initial step in extracting this knowledge from unstructured text and presenting it as a Knowledge Graph (KG). However, the previous approaches of NER have often suffered from small-scale human-labelled training data. Furthermore, extracting knowledge from Chinese medical literature is a more complex task because there is no segmentation between Chinese characters. Recently, the pretraining models, which obtain representations with the prior semantic knowledge on large-scale unlabelled corpora, have achieved state-of-the-art results for a wide variety of Natural Language Processing (NLP) tasks. However, the capabilities of pretraining models have not been fully exploited, and applications of other pretraining models except BERT in specific domains, such as NER in Chinese medical literature, are also of interest. In this paper, we enhance the performance of NER in Chinese medical literature using pretraining models. First, we propose a method of data augmentation by replacing the words in the training set with synonyms through the Mask Language Model (MLM), which is a pretraining task. Then, we consider NER as the downstream task of the pretraining model and transfer the prior semantic knowledge obtained during pretraining to it. Finally, we conduct experiments to compare the performances of six pretraining models (BERT, BERT-WWM, BERT-WWM-EXT, ERNIE, ERNIE-tiny, and RoBERTa) in recognizing named entities from Chinese medical literature. The effects of feature extraction and fine-tuning, as well as different downstream model structures, are also explored. Experimental results demonstrate that the method of data augmentation we proposed can obtain meaningful improvements in the performance of recognition. Besides, RoBERTa-CRF achieves the highest F 1-score compared with the previous methods and other pretraining models.

A Comprehensive Data Preprocessing Framework Towards Improving Internet Chinese Medical Data Quality

Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation

Data Annotation and Preprocessing

A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets

A Feasible Chinese Text Data Preprocessing Strategy.

Data Evaluation and Enhancement for Quality Improvement of Machine Learning

Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare

Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation

Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

CareBot: A Pioneering Full-Process Open-Source Medical Language Model

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue

An Integrated Data Processing Framework for Pretraining Foundation Models

On the Generation of Medical Dialogues for COVID19

Impact of high-quality, mixed-domain data on the performance of medical language models

Named Entity Recognition in Chinese Medical Literature Using Pretraining Models

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

A benchmark for automatic medical consultation system: frameworks, tasks and datasets