Effective Neural Topic Modeling with Embedding Clustering Regularization

Xiaobao Wu,Xinshuai Dong,Thong Nguyen,Anh Tuan Luu
2023-06-07
Abstract:Topic models have been prevalent for decades with various applications. However, existing topic models commonly suffer from the notorious topic collapsing: discovered topics semantically collapse towards each other, leading to highly repetitive topics, insufficient topic discovery, and damaged model interpretability. In this paper, we propose a new neural topic model, Embedding Clustering Regularization Topic Model (ECRTM). Besides the existing reconstruction error, we propose a novel Embedding Clustering Regularization (ECR), which forces each topic embedding to be the center of a separately aggregated word embedding cluster in the semantic space. This enables each produced topic to contain distinct word semantics, which alleviates topic collapsing. Regularized by ECR, our ECRTM generates diverse and coherent topics together with high-quality topic distributions of documents. Extensive experiments on benchmark datasets demonstrate that ECRTM effectively addresses the topic collapsing issue and consistently surpasses state-of-the-art baselines in terms of topic quality, topic distributions of documents, and downstream classification tasks.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the issue of "topic collapsing" present in existing topic models and proposes a new solution. Topic collapsing refers to the phenomenon where the discovered topics in existing topic models tend to be semantically close to each other, leading to highly repetitive topics. This not only reduces the effectiveness of topic discovery but also harms the interpretability of the model. To solve this problem, the authors propose a new neural topic model called the **Embedding Clustering Regularized Topic Model** (ECRTM). The main innovation of ECRTM lies in the introduction of a novel **Embedding Clustering Regularization** (ECR), which avoids the collapse of topic embeddings by forcing each topic embedding to become the center of a group of word embeddings in the semantic space. Specifically, ECR models the soft clustering assignment between topic embeddings and word embeddings by solving a specifically defined optimal transport problem, ensuring that each generated topic contains different word semantics, effectively alleviating the topic collapsing phenomenon. Moreover, ECRTM combines traditional topic modeling objectives with embedding clustering regularization objectives, enabling it to generate topics that are both diverse and coherent, as well as high-quality document-topic distributions. Extensive experiments on benchmark datasets demonstrate that ECRTM effectively addresses the topic collapsing problem and outperforms existing baseline methods in terms of both topic quality and document-topic distribution quality. In summary, this paper aims to address the prevalent issue of topic collapsing in neural topic models by proposing ECRTM, thereby enhancing the effectiveness and practicality of topic models.