Abstract:Applications in many domains such as text mining and natural language processing need to deal with high-dimensional data. High-dimensional data may present better clustering characteristics on a selected low-dimensional subspace. Subspace clustering is to project the data onto a low-dimensional subspace before clustering. Traditional subspace clustering methods employ eigenvalue decomposition to find the projection of the input data and perform K-means or kernel K-means to obtain the clustering matrix. This kind of methods is not only inefficient, but also adopts a two-step method to generate an approximate solution. Although Discriminative K-means (DisKmeans) integrates dimensionality reduction and clustering into a joint framework and solves the optimization problem by kernel K-means, such method needs to find the centroids in the kernel space and class labels iteratively and has a square time complexity. Accordingly, in this paper, we propose an algorithm, namely Fast DisKmeans (FDKM), to obtain the cluster indicator matrix in a direct way. Moreover, our proposed method has a linear time complexity, which is a significant reduction compared with the squared time complexity of DisKmeans. We also demonstrate that solving the object function of DisKmeans is equivalent to representing the cluster assignment matrix by a low-dimensional linear mapping of the data. Based on this observation, we propose the second algorithm, namely Iterative Fast DisKmeans (IFDKM), which also has a linear time complexity. A series of experiments were conducted on several datasets, and the experimental results showed the superior performance of FDKM and IFDKM.

Parallel Subspace Clustering Using MapReduce

Distributed Affinity Propagation Clustering Based on MapReduce

Parallel spectral clustering algorithm

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

An efficient PAM spatial clustering algorithm based on MapReduce

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

Subspace Clustering by Directly Solving Discriminative K-means

PARSUC: A Parallel Subsampling-Based Method for Clustering Remote Sensing Big Data

Large-Scale Subspace Clustering by Independent Distributed and Parallel Coding

5New density clustering algorithm based on MapReduce

Fast Clustering using MapReduce

The performance of MapReduce: an in-depth study

Parallel Sparse Subspace Clustering Via Joint Sample and Parameter Blockwise Partition.

The Performance of MapReduce

K-Means Parallel Algorithm of Big Data Clustering Based on Mapreduce PCAM Method

A Unified Framework for Representation-Based Subspace Clustering of Out-of-Sample and Large-Scale Data.

Large-scale Data Mining Method based on Clustering Algorithm Combined with MAPREDUCE

An Improved K-means Algorithm Based on Mapreduce and Grid

CLUSTER-BASED OCEAN REMOTE SENSING IMAGE FUSION PARALLEL COMPUTING STRATEGY

An Easy-to-Implement Framework of Fast Subspace Clustering for Big Data Sets.

Distributed structural clustering on large graph