PCCS:A FAST CLUSTERING AND CLASSIFICATION METHOD FOR WEB DOCUMENT

Ai-Hua WANG,Ming ZHANG,Dong-Qing YANG,Shi-Wei TANG
2001-01-01
Journal of Computer Research and Development
Abstract:Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. An interactive partially clustering method is put forword in this paper. First, PCCS uses the clustering algorithm to cluster part of the documents, finds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the Na l ¨ ve-Bayes classification algorithm to classify the rest documents. The incremental classification model can be saved and then be used to help classify future Web query results. In order to improve the efficiency of the method, a hybrid feature selection is also proposed to reduce dimension of document vector: Entropy feature selection and classification model based feature selection. It is shown that the method is faster than other algorithms. PCCS helps users more quickly and efficiently to navigate the results of a query at a more topical level than having to examine each documents text separately.
What problem does this paper attempt to address?