Abstract:In order to improve the accuracy and stability of K-means algorithm and solve the problem of determining the most appropriate number K of clusters and best initial seeds, an improved K-means algorithm based on density Canopy is proposed. Firstly, the density of sample data sets, the average sample distance in clusters and the distance between clusters are calculated, choosing the density maximum sampling point as the first cluster center and removing the density cluster from the data sets. Defining the product of sample density, the reciprocal of the average distance between the samples in the cluster, and the distance between the clusters as weight product, the other initial seeds is determined by the maximum weight product in the remaining data sets until the data sets is empty. The density Canopy is used as the preprocessing procedure of K-means and its result is used as the cluster number and initial clustering center of K-means algorithm. Finally, the new algorithm is tested on some well-known data sets from UCI machine learning repository and on some simulated data sets with different proportions of noise samples. The simulation results show that the improved K-means algorithm based on density Canopy achieves better clustering results and is insensitive to noisy data compared to the traditional K-means algorithm, the Canopy-based K-means algorithm, Semi-supervised K-means++ algorithm and K-means-u* algorithm. The clustering accuracy of the proposed K-means algorithm based on density Canopy is improved by 30.7%, 6.1%, 5.3% and 3.7% on average on UCI data sets, and improved by 44.3%, 3.6%, 9.6% and 8.9% on the simulated data sets with noise signal respectively. With the increase of the noise ratio, the noise immunity of the new algorithm is more obvious, when the noise ratio reached 30%, the accuracy rate is improved 50% and 6% compared to the traditional K-means algorithm and the Canopy-based K-means algorithm.

Improved k-means clustering method for codebook generation

VQ Codebook Design Using Modified K-means Algorithm with Feature Classification and Grouping Based Initialization

Improved K-means Algorithm Using Initialization Technique Based on Edge-Mean Grid for Image Vector Quantizer Design.

Fast Codebook Design Method for Image Vector Quantisation

Efficient and Effective Visual Codebook Generation Using Additive Kernels

Beyond the Euclidean Distance: Creating Effective Visual Codebooks Using the Histogram Intersection Kernel

An Incremental Clustering Based Codebook Construction in Video Copy Detection

A Method Of Optimizing Codebook Based On Codeword Use Frequency

Metric Learning in Codebook Generation of Bag-of-Words for Person Re-identification

Edge and Contrast Classified K-means Algorithm for Image Vector Quantizer Design.

A Novel Effective Distance Measure and a Relevant Algorithm for Optimizing the Initial Cluster Centroids of K-means

Computer Image Content Retrieval Considering K-Means Clustering Algorithm

An improved K-means algorithm based on multiple feature points

CPI-model-based analysis of sparse k-means clustering algorithms

An Improved Global K-means Clustering Algorithm

Improved K-means algorithm based on density Canopy

Center-adaptive weighted binary K-means for image clustering

An Improved K-means Algorithm Based on Mapreduce and Grid

K*-Means: An Efficient Clustering Algorithm with Adaptive Decision Boundaries

An Improved K-Means Clustering Algorithm Based on Spectral Method

An Improved K-Nearest Neighbor Algorithm for Text Categorization