TopicGPT: A Prompt-based Topic Modeling Framework

Chau Minh Pham,Alexander Hoyle,Simeng Sun,Philip Resnik,Mohit Iyyer
2024-04-02
Abstract:Topic modeling is a well-established technique for exploring text corpora. Conventional topic models (e.g., LDA) represent topics as bags of words that often require "reading the tea leaves" to interpret; additionally, they offer users minimal control over the formatting and specificity of resulting topics. To tackle these issues, we introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics in a text collection. TopicGPT produces topics that align better with human categorizations compared to competing methods: it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline. Its topics are also interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions. Moreover, the framework is highly adaptable, allowing users to specify constraints and modify topics without the need for model retraining. By streamlining access to high-quality and interpretable topics, TopicGPT represents a compelling, human-centered approach to topic modeling.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address several key issues in traditional topic models (such as LDA) in text topic mining: 1. **Poor interpretability**: The topics generated by traditional topic models are usually in the form of a bag of words, making them difficult to directly understand and interpret. 2. **Lack of user control**: Existing methods provide limited control for users over the specific format and details of the generated topics. 3. **Consistency and accuracy**: The proposed method aims to improve the consistency and accuracy between the topics and the true topics annotated by humans. To address these issues, the authors introduce the TopicGPT framework, which utilizes large language models to generate and assign context-related topics through prompts. This framework not only improves the quality of the topics but also enhances their interpretability and allows users to customize and modify the topics as needed without retraining the model. Experimental results show that compared to baseline methods such as LDA, SeededLDA, and BERTopic, TopicGPT demonstrates higher topic consistency and stability across multiple datasets.