EGA:An Algorithm for Automatic Semi-structured Web Documents Extraction

Liyu Li,Shiwei Tang,Dongqing Yang,Tengjiao Wang,Zhihua Su
DOI: https://doi.org/10.1007/978-3-540-24571-1_69
2004-01-01
Abstract:With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the web. In this paper, we study how to extract information from the semi-structured web documents by automatically generated wrappers. To automate the wrapper generation and the data extraction process, we develop a novel algorithm EGA (EPattern Generation Algorithm) to conduct the extraction pattern based on the local structural context features of the web documents. These optimal or near optimal extraction patterns are described in XPath language. Experimental results on RISE and our own data sets confirm the feasibility of our approach.
What problem does this paper attempt to address?