Extract List Data from Semi-Structured Document Using Clustering

H Xu,JZ Li,P Xu
DOI: https://doi.org/10.1109/nlpke.2005.1598800
2005-01-01
Abstract:This paper is concerned with list data extraction from semi-structured documents. By list data extraction, we mean extracting data from lists and grouping it by rows and columns. List, which has structured characteristics, is used to store highly structured and database-like information in many semi-structured documents, such as business annual reports, online airport listings, catalogs, hotel directories, etc. List data extraction is of benefit to text mining applications on semi-structured documents. Several research efforts have been done on structured data extraction from semi-structured documents by utilizing the word layout and arrangement information. However, as far as we know, few studies have been sufficiently investigated on list data extraction making use of the semantic information previously. In this paper, we propose a clustering based method making use of not only the layout and arrangement information but also the semantic information of words for this extraction task. We show experimental results on plain-text annual reports from Shanghai Stock Exchange, in which 73.49% of the lists were extracted correctly.
What problem does this paper attempt to address?