Chemical and Biological Entity Recognition System from Patent Documents

Hongchang LAI,Lijun ZHU,Shuo XU
DOI: https://doi.org/10.3772/j.issn.2095-915x.2015.04.011
2015-01-01
Abstract:It is crucial to explore the chemical and biological space covered by patent documents. In order to recognize chemical and biological entities, a recognition system is developed on the basis of open-source machine learning and natural language processing (NLP) toolkits. The system processing pipeline consists of three major components:pre-processing (sentence detection, tokenization), recognition (conditional random field (CRF) based approach), and post-processing (rule-based approach). The paper introduces each part in detail. Finally, extensive experiments on annotated chemical patent corpus are conducted, and the balanced-F measure is 69.20% with 10-fold cross validation. The results indicates that the performance on patent documents is slightly lower than that of counterpart on paper and news corpus.
What problem does this paper attempt to address?