Research of Adaptive Text Clustering Based on the Statistics of the Datasets

王纵虎,刘志镜,陈东辉
DOI: https://doi.org/10.15961/j.jsuese.2012.01.017
2012-01-01
Abstract:Due to the high dimensionality and sparseness of text data,the performance of traditional clustering algorithm may not be satisfied in clustering text data.The largest dense region having a small coverage rate with the partitioned clusters was selected out as initial cluster centroid set gradually by learning the similarity information between the partitioned and remainning sets.After generating the predetermined number of initial cluster centroid set,the remaining documents were assigned to their nearest clusters.By this way,the sensitivity of the clustering algorithm to the initial cluster centroid was reduced.Some threshold values used in this algorithm were calculated by the automatic statistic of the dataset dynamically in the process of clustering to avoid the blindness of the threshold parameters by experience or experiment in most clustering algorithms.The experiments on several Chinese and English datasets showed that this algorithm performes better than clustering algorithms in CLUTO.
What problem does this paper attempt to address?