Abstract:Among various document algorithms, K-means is a classical one. However it is a greedy algorithm, which is sensitive to the choice of cluster center and is much easier to result in local optimization. As genetic algorithm (GA) is a global convergence algorithm and the best cluster center can be found easily, a new dynamic document clustering method based on GA is presented in this paper. Reviewing all kinds of traditional document clustering methods, the partial similarity of kcywords was not taken into account, so the document similar matrix is a sparse matrix. To some extent, the accuracy of document similarity is influenced. In this paper, some new formulas are given which are improved based on the traditional method. The formulas take the partial similarity of keywords into account, thus improving the accuracy of the calculation of similarity. In this algorithm, the single individual is presented by a matrix which consists of K cluster centers. All individuals are encoded by floating-point numbers. The reciprocal of the sum of mean square deviation of intra-class distance plus one is adopted as the fitness function. The smaller the fitness function, the littler probability that the individual can be selected to enter the next generation. The optimal cluster center is finally found by the following iteration process: selection, crossover, mutation and so on. The simulation results show that the accuracy of this classification can reach over 98 percent and the algorithm is superior to K-means in performance. Thus, the algorithm of this paper is an effective method of document clustering.

An Improved K-means Algorithm for Document Clustering

An Improved K-Means Algorithm for Documents Clustering

K-Means Algorithm for Document Clustering with Optimal Initial Values

Weighted K-Means Algorithm Based Text Clustering

An Improved K-Means Algorithm of High-Dimensional Data

Algorithm and Experiment Research of Textual Document Clustering Based on Improved K-means

An Improved K-means Algorithm Based on Multiple Clustering and Density.

Towards effective document clustering: A constrained K-means based approach

An Improved Initial Cluster Centers Selection Algorithm for K-means Based on Features Correlative Degree

Variant of K-means Algorithm for Document Clustering: Optimization Initial Centers

An Improved K-Means Clustering Algorithm Based on Feature Weighting

A New Partitioning Based Algorithm for Document Clustering.

Design and simulation of a document clustering algorithm based on genetic algorithm

An adapted algorithm of choosing initial values for k-means document clustering

K-means Document Clustering Based on Latent Dirichlet Allocation

A Novel Rough Semi-Supervised K-Means Algorithm for Text Clustering

Clustering Algorithm on Block Division of Documents

High-Efficiency Text Clustering Algorithm Based on Semantic Distance

An improved clustering algorithm for web document

Design and Implementation of an Improved K-Means Clustering Algorithm

New K-Means Clustering Center Select Algorithm