Anomaly Aware Symmetric Non-negative Matrix Factorization for Short Text Clustering
Ximing Li,Yuanyuan Guan,Bo Fu,Zhongxuan Luo
DOI: https://doi.org/10.1007/s10115-024-02226-z
IF: 2.7
2024-01-01
Knowledge and Information Systems
Abstract:Short text clustering is a significant yet challenging task, where short texts generated from the Internet are extremely sparse, noisy, and ambiguous. The sparse nature makes traditional clustering methods, e.g., k-means family and topic modeling, much less effective. Fortunately, recent arts of document distance, e.g., word mover’s distance, and document representation, e.g., BERT, can accurately measure the similarities of short texts, especially their nearest neighbors. Inspired by those arts and observations, we induce short text clusters by directly factorizing the informative affinity matrix of nearest neighbors into the product of the cluster assignment matrix, following the intuition that neighboring short texts tend to be assigned to the same cluster. However, due to the noisy nature of short texts, many of them can be regarded as outliers or near outliers, resulting in many noisy neighboring similarities within the affinity matrix. To further alleviate this problem, we enhance the affinity matrix factorization by (1) incorporating a sparse noisy matrix to directly capture noisy neighboring similarities and (2) regularizing the cluster assignment matrix by ℓ _2,1 norm to eliminate hard-to-clustering short texts (called pseudo-outliers), so as to indirectly neglect noisy neighboring similarities corresponding to them. After this factorization for pre-clustering, we train a classifier over the resulting clusters and adopt it to assign each pseudo-outlier to one cluster finally. We call this novel clustering method as anomaly-aware symmetric non-negative matrix factorization ( A^2 snmf). Experimental results on benchmark short text datasets demonstrate that A^2 snmf performs very competitively with the existing baseline methods. The code is available at the website https://github.com/wizardbo/A3SNMF_functions .