Constructing Novel Block Layouts for Webpage Analysis
Zexun Jiang,Hao Yin,Yulei Wu,Yongqiang Lyu,Geyong Min,Xu Zhang
DOI: https://doi.org/10.1145/3326457
IF: 5.3
2019-01-01
ACM Transactions on Internet Technology
Abstract:Webpage segmentation is the basic building block for a wide range of webpage analysis methods. The rapid development of Web technologies results in more dynamic and complex webpages, which bring new challenges to this area. To improve the performance of webpage segmentation, we propose a two-stage segmentation method that can combine visual, logic, and semantic features of the contents on a webpage. Specifically, we devise a new model to measure the similarities of the elements on webpages based on both visual layout and logic organization in the first stage, and we propose a novel block regrouping method using semantic statistics and visual positions in the second stage. This two-stage method can effectively conduct webpage segmentation on complicated and dynamic webpages. The performance and accuracy of the method are verified by comparing with two existing webpage segmentation methods. The experiment results show that the proposed method significantly outperforms the existing state of the art in terms of higher precision, recall, and accuracy.