Abstract:Due to the rapid development of information technology and network technology, there is a lot of data, but the phenomenon of lack of knowledge is becoming more and more serious. Data mining technology has developed vigorously in this environment, and it has shown more and more vitality. Based on Spark programming model, this paper designs the parallel extension of fuzzy c-means. In order to enhance the performance of fuzzy c-means parallel expansion, the improvement strategy of k-means during the initialization phase is borrowed, and k-means// is extended to fuzzy c-means to obtain better clustering performance. Combined with Spark's programming model, this paper can obtain extended parallel fuzzy c-means algorithm. Several experiments on the data set of the algorithm proposed in this paper have shown good scalability and parallelism, effectively expanding fuzzy c-means clustering to distributed applications, greatly increasing the scale of the data processed by the algorithm. This improves the robustness of the algorithm and the adaptability of the algorithm to the shape and structure of the data, so that the parallel and scalable clustering algorithm can more effectively perform cluster analysis on big data. Three algorithms were simulated on MATLAB platform. We use simple data sets and complex two-dimensional data sets, and compare with the traditional fuzzy c-means algorithm and fuzzy c-means algorithm based on fuzzy entropy. Experiments show that the scalable parallel fuzzy c-means algorithm not only greatly improves the anti-noise performance, but also improves the convergence speed, and it can automatically determine the optimal number of clusters.

Parallel multi-label K-nearest neighbor algorithm based on Spark

Distributed Affinity Propagation Clustering Based on MapReduce

RW.KNN: a proposed random walk KNN algorithm for multi-label classification.

A Bayesian Network nearest k-labels method for Multi-label classification

Parallel spectral clustering algorithm

KunPeng: Parameter Server Based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial

Faster Nearest Neighbor Machine Translation

Parallelization of Classification Algorithms Based on SparkR

Data Mining Algorithm for Cloud Network Information Based on Artificial Intelligence Decision Mechanism

Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

Parallelization of Machine Learning Algorithms Respectively on Single Machine and Spark

Entropy-based Outlier Detection Using Spark

ML-KNN: A lazy learning approach to multi-label learning

A K-Nearest Neighbor Based Algorithm for Multi-Label Classification.

Efficient Processing of k Nearest Neighbor Joins using MapReduce

A split–merge clustering algorithm based on the k-nearest neighbor graph

Parallel and memory-efficient realization of KSP/KNN algorithm

Topic Detection and Tracking Based on Windowed DBSCAN and Parallel KNN