Abstract:In the world of big data, extracting meaningful insights from large and continually growing distributed datasets is a major challenge. Classical clustering algorithms are effective at identifying clusters with convex structures. However, they fall short in identifying arbitrary-shaped clusters (more irregular and complex patterns), which are often encountered in real-world applications. The process of identifying non-convex cluster representations from very large and growing datasets is a challenge. It is further compounded by the distributed nature of the data, necessitating complex computations across multiple devices. Support Vector Clustering (SVC) is a much-celebrated algorithm capable of finding arbitrarily shaped clusters. However, the major limitation of this algorithm is that it will not scale to large volumes of data as the time and space complexity is high. The second limitation of the SVC algorithm is the requirement for large computation time in finding cluster structures. The adoption of a coreset based methodology is required for finding the true representation of the underlying large datasets. The implementation of hierarchical clustering on these distributed coresets, unlocks the potential to uncover a structured hierarchy of abstractions across the disseminated data. Moreover, a distance-based clustering approach guarantees the identification of clusters with diverse and arbitrary shapes, providing a robust framework for detecting complex structures. This research utilizes the Core Vector Machine (CVM) approach using an approximate Minimum Enclosing Ball (MEB) algorithm to efficiently address the complexities inherent in traditional SVC. Additionally, an enhanced medoid algorithm is employed for cluster head identification across the data sources. Hierarchical clustering is performed in the Reproducing Kernel Hilbert Space (RKHS) using cosine similarity distance matrices. This is used to identify compact non-convex clusters within distributed datasets. Performance assessment involves benchmarking our approach against state-of-the-art improved SVC algorithms using large datasets. The outcomes validate the superior performance of our approach compared to existing methods.

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

MMSVC: an Efficient Unsupervised Learning Approach for Large-Scale Datasets.

An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce

Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging

CSFC: A New Centroid Based Clustering Method to Improve the Efficiency of Storing and Accessing Small Files in Hadoop

Scalable Hierarchical Agglomerative Clustering

Parallel Boosted Clustering

Distributed Kernel K-Means for Large Scale Clustering

Improved Hierarchical Clustering on Massive Datasets with Broad Guarantees

A Multi-disciplinary Ensemble Algorithm for Clustering Heterogeneous Datasets

Fully Scalable MPC Algorithms for Clustering in High Dimension

Decentralized Robust Subspace Clustering

High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data

Distributed Sparse Subspace Clustering by K-Means Subspace Fusion

Robust Clustering using Hyperdimensional Computing

SCA2: Novel Efficient Swarm Clustering Algorithm

Efficient hierarchical clustering of large high dimensional datasets.

HCDC: A novel hierarchical clustering algorithm based on density-distance cores for data sets with varying density

Hybrid raven roosting intelligence framework for enhancing efficiency in data clustering

Hashing-Based Distributed Clustering for Massive High-Dimensional Data