Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

H. S. Jennath,S. Asharaf
DOI: https://doi.org/10.1007/s13042-024-02121-7
2024-04-09
International Journal of Machine Learning and Cybernetics
Abstract:In the world of big data, extracting meaningful insights from large and continually growing distributed datasets is a major challenge. Classical clustering algorithms are effective at identifying clusters with convex structures. However, they fall short in identifying arbitrary-shaped clusters (more irregular and complex patterns), which are often encountered in real-world applications. The process of identifying non-convex cluster representations from very large and growing datasets is a challenge. It is further compounded by the distributed nature of the data, necessitating complex computations across multiple devices. Support Vector Clustering (SVC) is a much-celebrated algorithm capable of finding arbitrarily shaped clusters. However, the major limitation of this algorithm is that it will not scale to large volumes of data as the time and space complexity is high. The second limitation of the SVC algorithm is the requirement for large computation time in finding cluster structures. The adoption of a coreset based methodology is required for finding the true representation of the underlying large datasets. The implementation of hierarchical clustering on these distributed coresets, unlocks the potential to uncover a structured hierarchy of abstractions across the disseminated data. Moreover, a distance-based clustering approach guarantees the identification of clusters with diverse and arbitrary shapes, providing a robust framework for detecting complex structures. This research utilizes the Core Vector Machine (CVM) approach using an approximate Minimum Enclosing Ball (MEB) algorithm to efficiently address the complexities inherent in traditional SVC. Additionally, an enhanced medoid algorithm is employed for cluster head identification across the data sources. Hierarchical clustering is performed in the Reproducing Kernel Hilbert Space (RKHS) using cosine similarity distance matrices. This is used to identify compact non-convex clusters within distributed datasets. Performance assessment involves benchmarking our approach against state-of-the-art improved SVC algorithms using large datasets. The outcomes validate the superior performance of our approach compared to existing methods.
computer science, artificial intelligence
What problem does this paper attempt to address?