A Hybrid Machine Learning Method for the De-identification of Un-Structured Narrative Clinical Text in Multi-center Chinese Electronic Medical Records Data

Meng Jin,Kai Zhang,Yunhaonan Yang,Shuanglian Xie,Kai Song,Yonghua Hu,Xiaoyuan Bao
DOI: https://doi.org/10.1109/ICBK.2019.00023
2019-01-01
Abstract:The premise of the full use of unstructured electronic medical records is to maintain the fully protection of a patient's information privacy. Presently, in prior of processing the electronic medical record date, identification and removing of relevant information which can be used to identify a patient is a research hotspot nowadays. There are very few methods in de-identification of Chinese electronic medical records and their cross-center performance is poor. Therefore we develop a de-identification method which is a mixture of rule-based methods and machine learning methods. The method was tested on 700 electronic medical records from six hospitals. Five-fold cross test was used to evaluate the results of c5.0, Random Forest, SVM and XGBOOST. Leave-one-out test was used to evaluate CRF. And the F1 Measure of machine learning reached 91.18% in PHI_Names, 98.21% in PHI_MEDICALID, 95.74% in PHI_OTHERNFC, 97.14% in PHI_GEO, 89.19% in PHI_DATES, and 91.49% in PHI_TEL. And the F1 Measure of rule-based methods reached 93.00% in PHI_Names, 97.00% in PHI_MEDICALID, 97.00% in PHI_OTHERNFC, 97.00% in PHI_GEO, 96.00% in PHI_DATES, and 89.00% in PHI_TEL.
What problem does this paper attempt to address?