A novel graph partition based page segmentation algorithm

Ye Yumming,Li Chunshan,Zhang Xiaofeng
2012-01-01
Information
Abstract:In mobile device browsing and web mining applications, segmenting web pages into small semantic modules is an essential and helpful technique. Traditional segmentation algorithms can only exploit some heuristic and explicit information of DOM tree, while the important semantic information hidden in sub tree structures of DOM trees is ignored. This results in poor performance in segmenting complicated web pages. In this paper, we cast page segmentation problem to a graph partition problem, where each DOM tree can be extracted as a weighted graph. Then, the graph partition based page segmentation algorithm (GPPS) is proposed to identify semantic modules by mining the sub tree structures of DOM trees. The proposed approach is evaluated by performing rigorous experiments on various web datasets as well as a large-scale dataset The experimental results demonstrate that GPPS is superior to the VIPS in terms of precision, recall, F-score and the time consumptions. ©2012 International Information Institute.
What problem does this paper attempt to address?