An incremental clustering algorithm based on semantic concepts
Mahboubeh Soleymanian,Hoda Mashayekhi,Marziea Rahimi
DOI: https://doi.org/10.1007/s10115-024-02063-0
IF: 2.7
2024-02-15
Knowledge and Information Systems
Abstract:The evolution of data in text streams may cause feature and concept drifts. The former, while being less discussed in the literature, poses challenges for learning algorithms by changing the feature space of text representation. A common approach for handling concept drift is to maintain summarized groups of documents, known as micro-clusters. Despite the benefits, this scheme restricts document representation and poses challenges in the face of feature drift. In this paper, we propose an incremental text clustering algorithm that deals with both kinds of drifts. The algorithm uses incremental word embedding, which is rarely studied in the context of evolving data streams. We also propose a novel approach to leverage hierarchical summarized concepts instead of micro-clusters. The concepts reflect the semantic structure of the text stream and are continuously updated in the face of concept drift and evolution. The proposed method enables a customized low-dimensional and interpretable document representation, which improves the clustering quality. By employing concept modeling, in contrast with many available approaches, the proposed algorithm detaches the process of handling data evolution from document clustering. This modularization enables arbitrary variation in the granularity of document representation and allows for customized clustering when accessing the historical documents is impractical. The experimental results on several real datasets, and comparison with other incremental and non-incremental methods, show that the proposed algorithm can deal with dynamics in the feature space, and concept drift and evolution, while preserving its accuracy.
computer science, information systems, artificial intelligence