Abstract:Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to conduct effective cluster analysis in the data stream environment. With the continuous increase in the number of connected devices, these devices keep generating data streams, and real - time processing of these data streams is becoming more and more important. However, unlike traditional static data sets, data streams are infinite, arrive sequentially, and need to be processed in a short time, which brings new challenges to clustering algorithms. Specifically, the paper focuses on the following aspects:
1. **Concept Drift**: The instance attributes in the data stream may change over time, and this phenomenon is called concept drift. Traditional clustering algorithms usually assume that the data distribution is fixed, but this assumption does not hold in data streams, so algorithms that can detect and adapt to concept drift need to be developed.
2. **Data Structure**: Since data streams cannot be stored in their entirety and only summary information of the data can be stored, special data structures need to be designed to effectively summarize and store data streams.
3. **Time Window Model**: In order to process the most recent data more efficiently instead of the entire data stream, different time window models, such as decay window, landmark window, and sliding window models, are proposed.
4. **Outlier Detection**: There may be outliers in the data stream, which may be caused by malicious activities, instrument errors, transmission problems, etc. Outliers will affect the clustering results, so effective methods are needed to detect and handle these outliers.
5. **Real - Time Processing**: Data stream clustering needs to be processed immediately when the data arrives, which means that the algorithm must be able to complete the clustering task in a very short time.
By reviewing existing data stream clustering algorithms, the paper analyzes their performance in baseline clustering techniques, computational complexity, and clustering accuracy, and points out the open problems in current research. In addition, the paper also introduces popular data stream repositories, data sets, stream processing tools, and platforms, providing a comprehensive reference for researchers.