A Document Ensemble Clustering Approach Via Dimensionality Reduction
Xingzhi Li,Zhiwei Wang,Qi Lang,Xiaodong Liu
DOI: https://doi.org/10.23919/ccc63176.2024.10661621
2024-01-01
Abstract:The internet’s rapid expansion has generated abundant text data, posing a formidable challenge in extracting key information, where document clustering demonstrates its powerful role. Previously, many people commonly used a single clustering method, which may not be suitable for certain datasets. Additionally, documents are often represented as high-dimensional vectors, leading to the curse of dimensionality and sparsity issues, and clustering algorithms’ effectiveness in high-dimensional spaces is limited. Therefore, this paper proposes a document ensemble clustering method based on dimensionality reduction, which can overcome the limitations caused by using a single clustering method, making it more suitable for the majority of text data. Initially, dimensionality reduction techniques are applied to preprocess document embeddings generated. Subsequently, these embedding vectors are fed into an ensemble clustering module, where distances between samples are computed using multiple clustering methods. By adjusting weights, the distance matrices are subjected to weighted summation to obtain an appropriate distance matrix. Finally, the K-means method is applied to cluster division, yielding the most suitable clusters for extracting the most relevant keywords. The proposed method was evaluated empirically, achieving excellent results on the 20Newsgroups and Reuters-21578 datasets.