Web information extraction algorithm based on Web page segmentation

Hou Mingyan,Yang Tianqi
DOI: https://doi.org/10.3969/j.issn.1674-7720.2011.05.019
2011-01-01
Abstract:This paper proposes a Web information extraction algorithm based on Web division to solve the high complexity problem of unstructured information extraction. The method adopts Web noise pretreatment, carries on the tag path clustering according to the document object model tree structure of Web. The key part of the Web is determined rapidly through automatic training threshold value and Web page segmentation algorithm, and Web text extracted templates are obtained according to nesting structure in the data block. Experimental results on different kinds of Web sites show that the algorithm is fast and accurate.
What problem does this paper attempt to address?