FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model

Xiaobao Wu,Thong Nguyen,Delvin Ce Zhang,William Yang Wang,Anh Tuan Luu
2024-10-26
Abstract:Topic models have been evolving rapidly over the years, from conventional to recent neural models. However, existing topic models generally struggle with either effectiveness, efficiency, or stability, highly impeding their practical applications. In this paper, we propose FASTopic, a fast, adaptive, stable, and transferable topic model. FASTopic follows a new paradigm: Dual Semantic-relation Reconstruction (DSR). Instead of previous conventional, VAE-based, or clustering-based methods, DSR directly models the semantic relations among document embeddings from a pretrained Transformer and learnable topic and word embeddings. By reconstructing through these semantic relations, DSR discovers latent topics. This brings about a neat and efficient topic modeling framework. We further propose a novel Embedding Transport Plan (ETP) method. Rather than early straightforward approaches, ETP explicitly regularizes the semantic relations as optimal transport plans. This addresses the relation bias issue and thus leads to effective topic modeling. Extensive experiments on benchmark datasets demonstrate that our FASTopic shows superior effectiveness, efficiency, adaptivity, stability, and transferability, compared to state-of-the-art baselines across various scenarios.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of existing topic models in terms of efficiency, effectiveness or stability. Specifically: 1. **Efficiency problem**: Although the existing topic models based on variational auto - encoders (VAE) have good performance, they have high computational complexity and take a long time when processing large - scale data sets. For example, some models may take several hours to process a data set containing 10,000 documents. 2. **Effectiveness problem**: Although the clustering - based methods are efficient, they often generate repetitive topics, lack diversity, and have inaccurate topic distributions of documents. 3. **Stability problem**: The existing neural topic models are very sensitive to hyper - parameters, and their performance fluctuates greatly in different scenarios, especially when the data domain, vocabulary size and document length are different. To solve these problems, the paper proposes a new topic model - FASTopic. The main features of FASTopic are as follows: - **Fast**: Improve computational efficiency by simplifying the model structure. - **Adaptive**: Be able to maintain good performance in different scenarios. - **Stable**: Be insensitive to hyper - parameters and have more stable performance. - **Transferable**: Be able to be effectively applied to different data sets and tasks. FASTopic introduces a new paradigm - Dual Semantic - relation Reconstruction (DSR), and optimizes semantic relations through the Embedding Transport Plan (ETP) method, thus solving the above problems. Specifically: - **DSR paradigm**: Directly model the semantic relations among document embeddings, topic embeddings and word embeddings, and discover latent topics by reconstructing these relations. - **ETP method**: Model semantic relations as optimal transport plans, avoid relation bias problems, and generate more discriminative topics and more accurate document topic distributions. Through these innovations, the experimental results of FASTopic on multiple benchmark data sets show that it is superior to the existing state - of - the - art methods in terms of efficiency, effectiveness, adaptability, stability and transferability.