A Wrapper for Extracting Information Records from Forums Based on Page Segmentation

Chunshan Li,Jingjing Chen,Dianhui Chu,Ge Song,Haijun Zhang,Yunming Ye,Jianliang Xu
DOI: https://doi.org/10.14257/ijdta.2014.7.4.02
2014-01-01
International Journal of Database Theory and Application
Abstract:Foraging information from web forums is still one of the most challenging information retrieval tasks due to various combinations of auto-generated page structural information and user-created contents. Traditional information extraction methods employ either duplicated subtree pattern detection methods, or machine learning methods. Due to the periodical update of forum templates and diversity of page contents, aforementioned approaches do not work very well on forum sites. In this paper, we present a pagesegmentation based wrapper specially designed for mining data pattern of web forums, which combines a novel page segmentation algorithm and decision tree classifier together to detect the data pattern in forum. In the segmentation phase, a novel page segmentation algorithm is proposed to identify the records areas in a page, then a classifier is adopted to identify the detailed pattern of each record in the extraction phase. Extensive experiments on various types of web forums are conducted and the results conclude that our wrapper is a more generalized one which requires only few labeled training data.
What problem does this paper attempt to address?