A Web Site Mining Algorithm Using the Multiscale Tree Representation Model

YongHong Tian,TieJun Huang
2003-01-01
Abstract:Web site mining, which aims at automatically discovering and classifying topic-specific web sites from the World Wide Web, has attracted increasing attention as indicated by the exponential growth of both the amount and the diversity of the web information. This paper describes a novel multiscale approach for web site mining, which represents a web site as a multiscale site tree, extending the existing tree representation models of web sites to an extra level of resolution (Document Object Model or DOM nodes). Furthermore, the hidden Markov tree (HMT) is utilized to model the intrascale contextual dependencies in the multiscale site tree, and a context- based fusion algorithm is applied to combining the interscale context models with the HMT-based classifiers in order to refine the raw classification results. Moreover, for further improving classification accuracy while reducing the classification overheads, we introduce a two- stage text-based denoising procedure to remove the "noise" information within web sites, and an entropy- based approach to dynamically prune the site trees. Experiments show that our approach achieves in average 16% improvement in classification accuracy and 34.5% reduction in processing time over the baseline system.
What problem does this paper attempt to address?