A New Suffix Tree Similarity Measure for Document Clustering

hung chim,xiaotie deng
DOI: https://doi.org/10.1145/1242572.1242590
2007-01-01
Abstract:Document clustering methods mostly reply on single word term analysis based on Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are also considered in current relevant research work. In this seminar, we present a new phrase-based similarity measure to compute the pairwise similarity of text documents based on Suffix Tree Document model. By applying the new suffix tree similarity measure in Group-average Hierarchical Agglomerative Clustering (GHAC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpora OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term tf-idf similarity measure in the same GHAC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score.
What problem does this paper attempt to address?