Abstract:Among various document algorithms, K-means is a classical one. However it is a greedy algorithm, which is sensitive to the choice of cluster center and is much easier to result in local optimization. As genetic algorithm (GA) is a global convergence algorithm and the best cluster center can be found easily, a new dynamic document clustering method based on GA is presented in this paper. Reviewing all kinds of traditional document clustering methods, the partial similarity of kcywords was not taken into account, so the document similar matrix is a sparse matrix. To some extent, the accuracy of document similarity is influenced. In this paper, some new formulas are given which are improved based on the traditional method. The formulas take the partial similarity of keywords into account, thus improving the accuracy of the calculation of similarity. In this algorithm, the single individual is presented by a matrix which consists of K cluster centers. All individuals are encoded by floating-point numbers. The reciprocal of the sum of mean square deviation of intra-class distance plus one is adopted as the fitness function. The smaller the fitness function, the littler probability that the individual can be selected to enter the next generation. The optimal cluster center is finally found by the following iteration process: selection, crossover, mutation and so on. The simulation results show that the accuracy of this classification can reach over 98 percent and the algorithm is superior to K-means in performance. Thus, the algorithm of this paper is an effective method of document clustering.

Representing Document As Dependency Graph for Document Clustering

Document Clustering Using Locality Preserving Indexing

Parallel Topic Model and Its Application on Document Clustering.

Hierarchical Clustering Algorithms for Document Datasets

A New Suffix Tree Similarity Measure for Document Clustering

A Clustering Algorithm for Short Documents Based On Concept Similarity

Document Clustering Based on Word Sense Cluster

Semantic smoothing of document models for agglomerative clustering

Design and simulation of a document clustering algorithm based on genetic algorithm

Efficient Phrase-Based Document Similarity for Clustering

Co-Clustering With Manifold And Double Sparse Representation

Towards effective document clustering: A constrained K-means based approach

Concept chain based text clustering

Application of Genetic Algorithm in Document Clustering

K-means Document Clustering Based on Latent Dirichlet Allocation

Hybrid Data Clustering Based on Dependency Structure and Gibbs Sampling

Semantic Smoothing for Model-based Document Clustering

Document Clustering Based on Semantic Smoothing Approach

Medical Document Clustering Using Ontology-Based Term Similarity Measures

A comparison of two suffix tree-based document clustering algorithms

Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering