Developing a linguistically annotated corpus of Chinese electronic medical record

Zhipeng Jiang,Fangfang Zhao,Yi Guan
DOI: https://doi.org/10.1109/BIBM.2014.6999174
2014-01-01
Abstract:Electronic Medical Record (EMR) is the material base of smart healthcare, its automatic analysis is dependent on nature language processing (NLP) technologies. Syntactic analysis, as the basic technology of NLP, can be used to convert the free text of EMR to structured text. However, research on syntactic analysis, even Chinese word segmentation and part-of-speech (POS) tagging on Chinese electronic Medical record (CEMR), is currently at a blank stage because of the lack of annotated corpus on CEMR. To resolve this problem, we propose the annotated scheme from Chinese word segmentation to syntactic analysis, and built the first syntactically annotated corpus of CEMR. Through analyzing the annotated CEMR, we find it has stronger grammatical regularity and particular statistical distribution. These finds are taken advantage to improve the Stanford parser and develop a state-of-the-art Chinese word segmentation and POS tagging system for CEMR. The evaluation results show a substantial benefit to statistical machine learning models from the annotated CEMR.
What problem does this paper attempt to address?