Abstract:The traditional K-means clustering algorithm is difficult to initialize the number of clusters K, and the initial cluster centers are selected randomly, this makes the clustering results very unstable. Meanwhile, algorithms are susceptible to noise points. To solve the problems, the traditional K-means algorithm is improved. The improved method is divided into the same grid in space, according to the size of the data point property value and assigns it to the corresponding grid. And count the number of data points in each grid. Selecting M(M>K) grids, comprising the maximum number of data points, and calculate the central point. These M central points as input data, and then to determine the k value based on the clustering results. In the M points, find K points farthest from each other and those K center points as the initial cluster center of K-means clustering algorithm. At the same time, the maximum value in M must be included in K. If the number of data in the grid less than the threshold, then these points will be considered as noise points and be removed. In order to make the improved algorithm can adapt to handle large data. We will parallel the improved k-mean algorithm and combined with the MapReduce framework. Theoretical analysis and experimental results show that the improved algorithm compared to the traditional K-means clustering algorithm has high quality results, less iteration and has good stability. Parallelized algorithm has a very high efficiency in data processing, and has good scalability and speedup.

Parallel implementing k-means clustering algorithm using MapReduce programming mode

Parallel implementation of K-Means clustering algorithm based on mapReduce computing model of hadoop

A Parallel Implementation of the K-Means Algorithm Based on MapReduce

An Improved Parallel K-means Clustering Algorithm with MapReduce

An Improved Parallel K-means Algorithm Based on MapReduce

Parallel spectral clustering algorithm

Optimized Big Data K-means Clustering Using MapReduce

An Improved K-means Algorithm Based on Mapreduce and Grid

An Efficient K-Means Clustering Algorithm On Mapreduce

A 2-Tier Clustering Algorithm with Map-Reduce

The Parallel Implementation and Application of an Improved K-means Algorithm

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Research on K-medoids clustering algorithm based on data density and its parallel processing based on MapReduce

A multi-threaded particle swarm optimization-kmeans algorithm based on MapReduce

An Efficient Parallel Nonlinear Clustering Algorithm Using Mapreduce

Parallel K-Means Clustering Of Remote Sensing Images Based On Mapreduce

A fast algorithm for clustering with mapreduce

Research on Efficient K_Means Parallel Algorithm Based on Hadoop Distributed Architecture

Parallel Implementation Of Classification Algorithms Based On Mapreduce

An Enhanced Agglomerative Fuzzy K-Means Clustering Method with Mapreduce Implementation on Hadoop Platform

Parallel implementing loglikelihood similarity algorithm based on MapReduce programming model