Abstract:With the fast development of various techniques, more and more data have been accumulated with the unique properties of large size (tall) and high dimension (wide). The era of big data is coming. How to understand and discover new knowledge from these data has attracted more and more scholars' attention and has become the most important task in data mining. As one of the most important techniques in data mining, clustering analysis, a kind of unsupervised learning, could group a set data into objectives(clusters) that are meaningful, useful, or both. Thus, the technique has played very important role in knowledge discovery in big data. However, when facing the large-sized and high-dimensional data, most of the current clustering methods exhibited poor computational efficiency and high requirement of computational source, which will prevent us from clarifying the intrinsic properties and discovering the new knowledge behind the data. Based on this consideration, we developed a powerful clustering method, called MUFOLD-CL. The principle of the method is to project the data points to the centroid, and then to measure the similarity between any two points by calculating their projections on the centroid. The proposed method could achieve linear time complexity with respect to the sample size. Comparison with K-Means method on very large data showed that our method could produce better accuracy and require less computational time, demonstrating that the MUFOLD-CL can serve as a valuable tool, at least may play a complementary role to other existing methods, for big data clustering. Further comparisons with state-of-the-art clustering methods on smaller datasets showed that our method was fastest and achieved comparable accuracy. For the convenience of most scholars, a free soft package was constructed.

Fast algorithms for projected clustering

A Fast Algorithm for Clustering High Dimensional Feature Vectors

A Fast Algorithm for Density-Based Clustering in Large Database

Constraint-based Clustering by Fast Search and Find of Density Peaks

Efficient Approximate Algorithms for the Closest Pair Problem in High Dimensional Spaces.

An Algorithm for Clustering Based on Projected Cluster

Subspace Clustering by Directly Solving Discriminative K-means

A Fast Projection-Based Algorithm for Clustering Big Data

An Easy-to-Implement Framework of Fast Subspace Clustering for Big Data Sets.

Fast and Robust Subspace Clustering Using Random Projections.

Fast Clustering Using Adaptive Density Peak Detection

Using Projection-Based Clustering to Find Distance- and Density-Based Clusters in High-Dimensional Data

Fast gradient clustering

Fully Scalable MPC Algorithms for Clustering in High Dimension

Linear Time Algorithm for Projective Clustering

Faster Parallel Exact Density Peaks Clustering

An Algorithm for the Removal of Redundant Dimensions to Find Clusters in N-Dimensional Data Using Subspace Clustering

Simple, Scalable and Effective Clustering via One-Dimensional Projections

Dimensionality-reduced subspace clustering

On High Dimensional Projected Clustering of Data Streams

Projected Fuzzy C-Means Clustering Algorithm with Instance Penalty