An Integrated Approach to Heterogeneous Data for Information Extraction.

Ying Chen,Sophia Yat Mei Lee,Chu-Ren Huang
2009-01-01
Abstract:The paper proposes an integrated framework for web personal information extraction, such as biographical information and occupation, and those kinds of information are necessary to further construct a social network (a kind of semantic web) for a person. As web data is heterogeneous in nature, most of IE systems, regardless of named entity recognition (NER) or relation detection and recognition (RDR) systems, fail to get reliably robust results. We propose a flexible framework, which can effectively complement stateof-the-art statistical IE systems with rule-based IE systems for web data, and achieves substantial improvement over other existing systems. In particular, in our current experiment, both the rule-based IE system, which is designed according to some web specific expression patterns, and the statistical IE systems, which are developed for some homogeneous corpora, are sensitive only to specific information types. Hence we argue that our system performance can be incrementally improved when new and effective IE systems are added into our framework.
What problem does this paper attempt to address?