Abstract:Cluster analysis is one of the most fundamental methods in data mining, and it has been widely used in economics, social sciences and computer science. However, with the rapid development of Internet technology, the volume of data required for various web applications has grown rapidly, making the traditional clustering analysis methods face technical challenges. How to obtain useful information in a large amount of data quickly and efficiently is an urgent problem in many industrial fields. With the continuous development of cloud computing technology, large amounts of data can be performed quickly and efficiently. Hadoop is an open source distributed cloud computing platform with HDFS (Digital File System) and MapReduce as its core. HDFS provides massive data storage, while MapReduce uses the MapReduce programming model to achieve parallel processing. Compared with the traditional parallel programming model, it contains basic functions such as data partitioning, task scheduling, and parallel processing, making it possible for users to develop distributed applications on their own without understanding the basics of distributed basics, thus facilitating the design of parallel programs. K-means algorithm is a typical clustering analysis method, which is widely used in industry, but the number of iterations will increase significantly due to the growth of data volume, thus reducing the efficiency of computation. In order to better apply to the cluster analysis of large-scale data, this paper firstly implements a parallelization algorithm based on MapReduce on Hadoop platform using the basic idea of MapReduce and improves the K-means algorithm for the problems of blindness and easy to fall into local optimum when selecting randomly in clusters.

A Distributed Multi-exemplar Affinity Propagation Clustering Algorithm Based on MapReduce.

Distributed Affinity Propagation Clustering Based on MapReduce

Multi-exemplar affinity propagation clustering based on local density peak

A 2-Tier Clustering Algorithm with Map-Reduce

A Parallel Implementation of the K-Means Algorithm Based on MapReduce

Parallel implementation of K-Means clustering algorithm based on mapReduce computing model of hadoop

An efficient PAM spatial clustering algorithm based on MapReduce

Large-scale Data Mining Method based on Clustering Algorithm Combined with MAPREDUCE

MapReduce-based distributed tensor clustering algorithm

Towards Scalable Subgraph Pattern Matching over Big Graphs on MapReduce.

An Improved Parallel K-means Clustering Algorithm with MapReduce

Optimized Big Data K-means Clustering Using MapReduce

MR-Mafia: Parallel Subspace Clustering Algorithm Based on MapReduce for Large Multi-dimensional Datasets

K-Means Clustering with Bagging and MapReduce

An Improved Parallel K-means Algorithm Based on MapReduce

A Parallel Clustering Algorithm for Power Big Data Analysis.

An Enhanced Agglomerative Fuzzy K-Means Clustering Method with Mapreduce Implementation on Hadoop Platform

Parallel clustering of very large document datasets with MapReduce

Research on K-medoids clustering algorithm based on data density and its parallel processing based on MapReduce

Grouping Users Using a Combination-Based Clustering Algorithm in the Service Environment

Research on Efficient K_Means Parallel Algorithm Based on Hadoop Distributed Architecture