A robust approach of automatic web data record extraction

Yongquan Dong,Qingzhong Li
2009-01-01
Journal of Computational Information Systems
Abstract:The automatic extraction of Web data record is a key problem for Web data integration. However, the bottleneck problem is the structure diversity of Web pages, which leads to the low precision of Web data record extraction. A robust approach is proposed to automatically extract data records from Web pages. Firstly, it utilizes the visual information to identify data region containing data records. Secondly, it uses clustering algorithm to divide similar nodes into one group. Finally, it uses repetitive regularity of text nodes of different records to generate the final data records. Comparing with many other existing works, the approach is applicable for pages on which any data record is in the same subtree and across different subtrees. Experiment results on real web pages show that our approach can achieve high extraction accuracy and outperform the existing techniques substantially. 1553-9105/ Copyright © 2009 Binary Information Press.
What problem does this paper attempt to address?