An optimized BERTopic framework based on cluster silhouette for improving topic coherence

Huiying Yan,Yu Zhang
DOI: https://doi.org/10.1145/3652628.3652721
2023-11-17
Abstract:Topic model has been a useful method to discover latent topic in a collections of documents. In this study, the BERTopic model is conducted on three distinct news datasets and exploring the relationship between min_cluster_size and cluster silhouette scores, as well as the Spearman correlation between cluster silhouette scores and topic coherence. We found that there is a strong positive relation between topic coherence and cluster silhouette scores. So building upon these insights, an optimized BERTopic framework is proposed to improve topic coherence by finding the optimal silhouette score through iterating min_cluster_size. The innovation lies in establishing the correlation between clustering silhouette scores and topic coherence. To adapt to different corpora, varying min_cluster_size is dynamically employed to attain the optimal silhouette score, thereby improving topic coherence. Experimental results demonstrate a 12.24% improvement in topic coherence compared to the default BERTopic framework, showcasing the potential value of the optimized BERTopic framework for enhancing the effectiveness of unsupervised topic modeling.
Computer Science
What problem does this paper attempt to address?