Dom based extraction of topical information from web pages

刘军,张净
DOI: https://doi.org/10.3969/j.issn.1000-386X.2010.05.056
2010-01-01
Abstract:With the development of the Internet,the amount as well as the density of Web pages information increase day by day.However the representation of the topical information is usually not manifest enough,and this makes it difficult to acquire the topical information.A new extraction algorithm is proposed to solve this issue by constructing the DOM tree and then adding attributes to it such as display,semantics(link number,unlinked words number,height and width,etc.),as well as presenting a clustering rule for partitioning the DOM tree,the last part of the algorithm is to prune the DOM tree to expel redundancies so as to pick up the topical information.This approach can accurately extract the topical information as shown by the experiment.
What problem does this paper attempt to address?