Automatic k-means clustering algorithm based on data sampling

Jun-feng LUO,Dan-dan HONG
DOI: https://doi.org/10.3969/j.issn.1004-373X.2014.08.006
2014-01-01
Abstract:In order to solve the problems of the traditional k-means algorithm in which k values needs to be input and the the ultra-large-scale data set needs to be clustered,on the basis of previous studies,the information entropy is brought in when distance is calculated,and data sampling method is adopted,that is,the optimal samples are extracted from the ultra-large-scale data set to conduct sample clustering. Based on the sample data clustering,the validity indexes are verified and k value re-quired by the algorithm is obtained. The distance formula for information entropy is brought in to carry out clustering on the ultra-large data set. Experiments show that the algorithm can overcome the defects of traditional k-means algorithm for k value input, and can automatically obtain k values of ultra-large data clustering under the premise of not affecting the quality of the early da-ta clustering.
What problem does this paper attempt to address?