Template-Independent Wrapper For Web Forums

Qi Zhang,Yang Shi,Xuanjing Huang,Lide Wu
DOI: https://doi.org/10.1145/1571941.1572132
2009-01-01
Abstract:This paper presents a novel work on the task of extracting data from Web forums. Millions of users contribute rich information to Web forum everyday, which has become an important resource for many Web applications, such as product opinion retrieval, social network analysis, and so on. The novelty of the proposed algorithm is that it can not only extract the pure text but also distinguish between the origin post and replies. Experimental results on a large number real Web forums indicate that the proposed algorithm can correctly extract data from websites with various styles in most cases.
What problem does this paper attempt to address?