Research and implementation of FFT-based extraction algorithm of webpage content main body

LI Lei,WANG Jin-lin,BAI He,HU Jing-jing
DOI: https://doi.org/10.3321/j.issn:1002-8331.2007.30.046
2007-01-01
Abstract:This paper studies the extraction algorithm of the effective information of "Content-Dominated" Web pages.This kind of Web pages contains the major content information of the Web sites.It includes a long paragraph of content main body,and format information in the beginning and the ending(e.g.navigation information,interaction information,JavaScript and so on).This paper analyzes the structural characteristics of this kind of Web page,and transformed the problem as:given an HTML source file of a "Content-Dominated" Webpage,to find the best range of the content main body.Presents an FFT-based extraction algorithm of webpage content main body.By applying window-segmentation,statistics theory and FFT,this method calculates the weight of every possible range;and thereby selects the best one as solution.The experimental result proves that this algorithm can efficiently extract the effective information of "Content-Dominated" Web pages.
What problem does this paper attempt to address?