MOStream: A Modular and Self-Optimizing Data Stream Clustering Algorithm

Zhengru Wang,Xin Wang,Shuhao Zhang
2024-06-17
Abstract:Data stream clustering is a critical operation in various real-world applications, ranging from the Internet of Things (IoT) to social media and financial systems. Existing data stream clustering algorithms, while effective to varying extents, often lack the flexibility and self-optimization capabilities needed to adapt to diverse workload characteristics such as outlier, cluster evolution and changing dimensions in data points. These limitations manifest in suboptimal clustering accuracy and computational inefficiency. In this paper, we introduce MOStream, a modular and self-optimizing data stream clustering algorithm designed to dynamically balance clustering accuracy and computational efficiency at runtime. MOStream distinguishes itself by its adaptivity, clearly demarcating four pivotal design dimensions: the summarizing data structure, the window model for handling data temporality, the outlier detection mechanism, and the refinement strategy for improving cluster quality. This clear separation facilitates flexible adaptation to varying design choices and enhances its adaptability to a wide array of application contexts. We conduct a rigorous performance evaluation of MOStream, employing diverse configurations and benchmarking it against 9 representative data stream clustering algorithms on 4 real-world datasets and 3 synthetic datasets. Our empirical results demonstrate that MOStream consistently surpasses competing algorithms in terms of clustering accuracy, processing throughput, and adaptability to varying data stream characteristics.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in data stream clustering (DSC), existing algorithms lack flexibility and self - optimization capabilities and are unable to adapt to different workload characteristics (such as outliers, cluster evolution, and changes in the dimension of data points). These problems lead to a decline in clustering accuracy and computational efficiency. The paper proposes a modular and self - optimizing data stream clustering algorithm MOStream, aiming to dynamically balance clustering accuracy and computational efficiency to adapt to changes in data stream characteristics in various application scenarios. Specifically, the paper points out that although existing data stream clustering algorithms are effective to varying degrees, they often lack the flexibility and self - optimization capabilities to adapt to different workload characteristics. These limitations are manifested in unsatisfactory clustering accuracy and low computational efficiency. To overcome these problems, MOStream is designed in the following four aspects: 1. **Summarizing Data Structure**: It is used to compactly represent data points, capture key information while minimizing memory usage. 2. **Window Model**: It deals with the time characteristics of data streams, focuses on the most recent data points and discards obsolete information. 3. **Outlier Detection Mechanism**: It identifies and deals with noise and outliers in data streams to prevent them from affecting the clustering quality. 4. **Refinement Strategy**: As new data points arrive, it updates and optimizes the clustering model to adapt to changes in the data distribution. MOStream effectively addresses these challenges by dynamically detecting workload characteristics and reconfiguring itself. Experimental results show that MOStream is superior to nine existing representative data stream clustering algorithms in terms of clustering accuracy, processing throughput, and adapting to changes in data stream characteristics.