Adaptive Approach for Content Extraction Based on Tag Density

SUN Hao,DONG Shou-bin
2009-01-01
Abstract:A novel approach for removing Web page noises is presented by exploiting the differences of density of anchor text and tag in different parts of Web page.According to fluctuations in the tag distribution of content regions,the algorithm adaptively learns relative thresholds so as to effectively remove Web noises.In the experiments of content information extraction and Chinese Web page classificaition,it indicates that the approach for denoising is effective and feasible compared to other approaches.
What problem does this paper attempt to address?