Approach for Web Data Extraction Based on XPath Comparison

CHEN Xiao-feng,ZHANG Ling,DONG Shou-bin
DOI: https://doi.org/10.3969/j.issn.1671-6841.2007.02.038
2007-01-01
Abstract:The method of extracting data from a Web page that contains several data blocks is studied.After the comparison of each data block's XPath,it can be found that they are very similar.Based on this observation,an XPath-comparison-base Extraction Rules Generation Algorithm(XERG) is proposed.When the data block extraction rules are ready,the inner-block information can be extracted by relative XPath or regular expressions.Experimental results show that this method is able to obtain data blocks and extract data from them very accurately.
What problem does this paper attempt to address?