Learning outliers to refine a corpus for chinese webpage categorization

Dingsheng Luo,Xinhao Wang,Xihong Wu,Huisheng Chi
DOI: https://doi.org/10.1007/11539087_19
2005-01-01
Abstract:Webpage categorization has turned out to be an important topic in recent years. In a webpage, text is usually the main content, so that auto text categorization (ATC) becomes the key technique to such a task. For Chinese text categorization as well as Chinese webpage categorization, one of the basic and urgent problems is the construction of a good benchmark corpus. In this study, a machine learning approach is presented to refine a corpus for Chinese webpage categorization, where the AdaBoost algorithm is adopted to identify outliers in the corpus. The standard k nearest neighbor (kNN) algorithm under a vector space model (VSM) is adopted to construct a webpage categorization system. Simulation results as well as manual investigation of the identified outliers reveal that the presented method works well.
What problem does this paper attempt to address?