Abstract:The task of discovering topics in text corpora has been dominated by Latent Dirichlet Allocation and other Topic Models for over a decade. In order to apply these approaches to massive text corpora, the vocabulary needs to be reduced considerably and large computer clusters and/or GPUs are typically required. Moreover, the number of topics must be provided beforehand but this depends on the corpus characteristics and it is often difficult to estimate, especially for massive text corpora. Unfortunately, both topic quality and time complexity are sensitive to this choice. This paper describes an alternative approach to discover topics based on Min-Hashing, which can handle massive text corpora and large vocabularies using modest computer hardware and does not require to fix the number of topics in advance. The basic idea is to generate multiple random partitions of the corpus vocabulary to find sets of highly co-occurring words, which are then clustered to produce the final topics. In contrast to probabilistic topic models where topics are distributions over the complete vocabulary, the topics discovered by the proposed approach are sets of highly co-occurring words. Interestingly, these topics underlie various thematics with different levels of granularity. An extensive qualitative and quantitative evaluation using the 20 Newsgroups (18K), Reuters (800K), Spanish Wikipedia (1M), and English Wikipedia (5M) corpora shows that the proposed approach is able to consistently discover meaningful and coherent topics. Remarkably, the time complexity of the proposed approach is linear with respect to corpus and vocabulary size; a non-parallel implementation was able to discover topics from the entire English edition of Wikipedia with over 5 million documents and 1 million words in less than 7 hours.

Using Topic Modeling for Code Discovery in Large Scale Text Data

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

Crime topic modeling

Topic Modeling Using Distributed Word Embeddings

Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Interactive Topic Modeling Based on Hierarchical Dirichlet Process

BTM: Topic Modeling over Short Texts

A hybrid deep learning method for identifying topics in large-scale urban text data: Benefits and trade-offs

Topic Discovery in Massive Text Corpora Based on Min-Hashing

Mining Cohesive Domain Topics from Source Code

Automatic deductive coding in discourse analysis: an application of large language models in learning analytics

Hierarchical Latent Semantic Mapping for Automated Topic Generation

News Topic Discovery Through Community Detection

Analyses of Multi-collection Corpora via Compound Topic Modeling

Utilizing Recurrent Neural Network for Topic Discovery in Short Text Scenarios

An Examination of the Use of Large Language Models to Aid Analysis of Textual Data

Obtaining Functional Topics from Source Code Based on Topic Modeling and Static Analysis

Topic Modeling over Short Texts by Incorporating Word Embeddings

Computer-Assisted Text Analysis for Social Science: Topic Models and Beyond