Real-time Page Classification Oriented Algorithm on Topic Extraction

彭浩,王雅琳
DOI: https://doi.org/10.3969/j.issn.1006-2475.2008.07.003
2008-01-01
Abstract:Real-time Web page classification is an important issue for focused crawler.The popular topic extraction algorithms is not satisfied focused crawling applications.A real-time page classification oriented topic extraction algorithm named HTTE-MTP(Html Tree based Topic Extraction on Multi-Topic Web Pages) is proposed.This paper produces an improved page model based on Html tag tree,in which the expanded node type is regarded as important weight fact of keywords .The new algorithm performs much better than other methods.At last,a new performance model is put forward.
What problem does this paper attempt to address?