A weakly supervised method for named entity recognition of Chinese electronic medical records

Meng Li,Chunrong Gao,Kuang Zhang,Huajian Zhou,Jing Ying
DOI: https://doi.org/10.1007/s11517-023-02871-6
2023-07-15
Medical & Biological Engineering & Computing
Abstract:The field of Chinese medical natural language processing faces a significant challenge in training accurate entity recognition models due to the limited availability of high-quality labeled data. In response, we propose a joint training model, MCBERT-GCN-CRF, which achieves high performance in identifying medical-related entities in Chinese electronic medical records. Additionally, we introduce CM-NER, a 5-step framework that effectively mitigates the effects of noise in weakly labeled data and establishes a principled connection between supervised and weakly supervised named entity recognition. We demonstrate significant improvements in recall rate and accuracy. Our approach outperforms traditional fully supervised pre-training models and other state-of-the-art methods by suppressing noise in weakly labeled data. Our proposed framework achieves an F1 score of 86.29% on the CCKS-2019 dataset, significantly higher than pre-trained model baselines ranging from 74.17 to 83.06%, and higher than the top-performing named entity recognition supervised learning models in the CCKS-2019 competition. Our results demonstrate the effectiveness of our proposed framework and highlight the potential of leveraging unlabeled data to train accurate models for named entity recognition in Chinese medical natural language processing. This research has significant implications for advancing natural language processing techniques in the medical domain and improving patient care.Graphical Abstract
engineering, biomedical,computer science, interdisciplinary applications,mathematical & computational biology,medical informatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced in entity recognition in Chinese electronic medical records, especially the difficulty in training a high - precision entity recognition model due to the limited amount of high - quality labeled data. Specifically, the paper proposes a jointly - trained model (MCBERT - GCN - CRF) combined with a weakly - supervised learning method, as well as a five - step framework (CM - NER), aiming to improve model performance by using large - scale unlabeled data while reducing the workload of manual labeling. This method can not only effectively suppress the noise in weakly - labeled data, but also significantly improve the recall rate and accuracy, thereby achieving more effective entity recognition in the field of Chinese medical natural language processing.