A Method of Web Information Extraction Based on Classification Algorithm

WANG Jian-Wei,YANG Dong-Qing,GAO Jun,WANG Teng-Jiao
DOI: https://doi.org/10.3969/j.issn.1002-137X.2008.03.026
2008-01-01
Computer Science
Abstract:In the research of Web information extraction,most of the existing algorithms are based on HTML structure. As the structure of HTML files changes frequently,wrapper must be updated accordingly. But the update of wrapper needs a lot of domain knowledge. In this paper,a new Web information extraction method based on classification algorithm is provided,which can group the Web text by HTML text display attributes. The information extraction of Web pages is finished by classifying the Web text with different values of the display attributes and acquiring desired text. This algorithm is easy to implementation and small-dependent of the HTML structure. Experiments prove its good performance.
What problem does this paper attempt to address?