Abstract:The traditional K-means clustering algorithm is difficult to initialize the number of clusters K, and the initial cluster centers are selected randomly, this makes the clustering results very unstable. Meanwhile, algorithms are susceptible to noise points. To solve the problems, the traditional K-means algorithm is improved. The improved method is divided into the same grid in space, according to the size of the data point property value and assigns it to the corresponding grid. And count the number of data points in each grid. Selecting M(M>K) grids, comprising the maximum number of data points, and calculate the central point. These M central points as input data, and then to determine the k value based on the clustering results. In the M points, find K points farthest from each other and those K center points as the initial cluster center of K-means clustering algorithm. At the same time, the maximum value in M must be included in K. If the number of data in the grid less than the threshold, then these points will be considered as noise points and be removed. In order to make the improved algorithm can adapt to handle large data. We will parallel the improved k-mean algorithm and combined with the MapReduce framework. Theoretical analysis and experimental results show that the improved algorithm compared to the traditional K-means clustering algorithm has high quality results, less iteration and has good stability. Parallelized algorithm has a very high efficiency in data processing, and has good scalability and speedup.

Research on Efficient K_Means Parallel Algorithm Based on Hadoop Distributed Architecture

Distributed Affinity Propagation Clustering Based on MapReduce

A Parallel Implementation of the K-Means Algorithm Based on MapReduce

The Study Of Parallel K-Means Algorithm

Parallel implementation of K-Means clustering algorithm based on mapReduce computing model of hadoop

An Improved Parallel K-means Algorithm Based on MapReduce

The Parallel Implementation and Application of an Improved K-means Algorithm

A Parallel K-Means Clustering Algorithm with MPI

An Improved Parallel K-means Clustering Algorithm with MapReduce

Study of Fast Parallel Clustering Partition Algorithm for Large Data Sets

An Efficient K-Means Clustering Algorithm On Mapreduce

A Parallel Clustering Algorithm for Power Big Data Analysis.

A multi-threaded particle swarm optimization-kmeans algorithm based on MapReduce

Research on K-medoids clustering algorithm based on data density and its parallel processing based on MapReduce

A Novel Density Based Clustering Algorithm and Its Parallelization.

An Improved K-means Algorithm Based on Mapreduce and Grid

Research of an Impoved K-means Algorithm for Aanalyzing Mass Data

Optimization of k-means clustering algorithm in hadoop distributed computing framework

Parallel implementing k-means clustering algorithm using MapReduce programming mode

A modified parallel k-means clustering with improved initial centers

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration