An Efficient Centroid Based Chinese Web

Liu Hui,Peng Ran,Ye Shaozhi,Li Xing
2004-01-01
Abstract:In this paper, we present an efficient centroid based Chinese web page classifier that has achieved satisfactory performance on real data and runs very fast in practical use. Except for its clear design, this classifier has some creative features: Chinese word segmentation and noise filtering technology in preprocessing module; combined 2 χ Statistics feature selection method; adaptive factors to improve categorization performance. Another advantage of this system is its optimized implementation. Finally we show performance results of experiments on a corpus from Peking University of China, and some discussions.
What problem does this paper attempt to address?