Investigating topic modeling techniques through evaluation of topics discovered in short texts data across diverse domains

R. Muthusami,N. Mani Kandan,K. Saritha,B. Narenthiran,N. Nagaprasad,Krishnaraj Ramaswamy
DOI: https://doi.org/10.1038/s41598-024-61738-4
IF: 4.6
2024-05-27
Scientific Reports
Abstract:The online channel has affected many facets of an individual's identity, commercial, social policy, and culture, among others. It implies that discovering the topics on which these brief writings are focused, as well as examining the qualities of these short texts is critical. Another key issue that has been identified is the evaluation of newly discovered topics in terms of topic quality, which includes topic separation and coherence. A topic modeling method has been shown to be an outstanding aid in the linguistic interpretation of quite tiny texts. Based on the underlying strategy, topic models are divided into two categories: probabilistic methods and non-probabilistic methods. In this research, short texts are analyzed using topic models, including latent Dirichlet allocation (LDA) for probabilistic topic modeling and non-negative matrix factorization (NMF) for non-probabilistic topic modeling. A novel approach for topic evaluation is used, such as clustering methods and silhouette analysis on both models, to investigate performance in terms of quality. The experiment results indicate that the proposed evaluation method outperforms on both LDA and NMF.
multidisciplinary sciences
What problem does this paper attempt to address?
The main problem this paper attempts to address is the issue of topic quality evaluation in short text topic modeling. Specifically, the paper focuses on how to evaluate the quality of topics discovered by two types of topic modeling techniques—probabilistic topic models (such as LDA) and non-probabilistic topic models (such as NMF)—through clustering methods and silhouette analysis. These evaluation metrics include topic separability and topic coherence, aiming to improve the effectiveness and accuracy of topic modeling when dealing with short text data. The paper points out that although there have been studies attempting to perform topic modeling on short texts, evaluating the quality of the discovered topics remains a challenge. Therefore, this paper proposes a new evaluation method that measures the performance of LDA and NMF models in topic discovery through clustering techniques and silhouette coefficients. Experimental results show that the proposed evaluation method performs well on both models, effectively assessing the quality of the topics. In addition, the paper explores the application effects of different clustering linkage methods (such as Ward's method, single linkage, complete linkage, average linkage, and McQuitty's method) in the LDA model and presents the clustering results under different methods through dendrograms. The study finds that the McQuitty linkage method performs best in evaluating the quality of topic models, with clustering consistency close to 100%, showing good clustering effects and topic cohesion. In summary, this paper addresses the issue of topic quality evaluation in short text topic modeling by proposing a new method based on clustering and silhouette analysis, providing effective tools and references for subsequent research.