Web Data Extraction System Based on Label Library

Shoubiao Tan,Chao Xu,Yuan Jiang
DOI: https://doi.org/10.1109/fskd.2009.208
2009-01-01
Abstract:A Web Information Extraction System based on label library is proposed for extracting information from data intensive web pages in this paper. It downloads dynamic web pages based on a knowledge database, changes them to XML documents after a preprocessing, mines data regions by using MDR repeated patterns discovery algorithm, recognizes their structure and extracts data from them through a novel hierarchic pattern recognition and data extraction algorithm based on label library, and stores the data into the knowledge database to support further information extraction. Experiments showed that the system has high precision and is adaptive to web pages in different domains and with different structures.
What problem does this paper attempt to address?