Web Page Segmentation and Its Application for Web Information Crawling

Hanyang Feng,Wenzhe Zhang,Hesheng Wu,Chong-Jun Wang
DOI: https://doi.org/10.1109/ictai.2016.0097
2016-01-01
Abstract:Web page segmentation aims to break a page into sections that can reveal the information presentation structure and appear coherent to readers. In this paper, we propose a new web page segmentation framework based on the process of analyzing and understanding web page structure. After extracting the segmentation graph structure, we formulate the label assignment task which determines whether each boundary should segment current block or not on a graph as a structured learning problem. Computation of highest scoring label assignment relies on Viterbi algorithm and joint feature function captures the dependency among boundaries. To solve the learning of parameters, we adopt a learning model based on perceptron algorithm. Furthermore, utilizing the previous framework, we propose a web information crawling application framework which integrates web page segmentation and semantic block classification process.
What problem does this paper attempt to address?