Extraction Approach of Patent Information Based on Regular Expression

Qiu Qingying,Zheng Guomin,Feng Pei'en,Wu Jianwei
DOI: https://doi.org/10.3321/j.issn:1004-132x.2007.19.014
2007-01-01
Abstract:Since current patent documents are saved as image-based type such as.TIF,.PDF,and so on,they are difficult for full-text search and further analysis.The approach that adoped the optical character recognition(OCR) tool and the fault-tolerant regular expressions was proposed for patent digitization and information extraction according to the structural features of patent documents.The software system was developed to support the batch extraction of patent information,which provided the data resources for the following automatic patent analysis and knowledge mining.
What problem does this paper attempt to address?