Abstract:Topic models are a class of unsupervised learning algorithms for detecting the semantic structure within a text corpus. Together with a subsequent dimensionality reduction algorithm, topic models can be used for deriving spatializations for text corpora as two-dimensional scatter plots, reflecting semantic similarity between the documents and supporting corpus analysis. Although the choice of the topic model, the dimensionality reduction, and their underlying hyperparameters significantly impact the resulting layout, it is unknown which particular combinations result in high-quality layouts with respect to accuracy and perception metrics. To investigate the effectiveness of topic models and dimensionality reduction methods for the spatialization of corpora as two-dimensional scatter plots (or basis for landscape-type visualizations), we present a large-scale, benchmark-based computational evaluation. Our evaluation consists of (1) a set of corpora, (2) a set of layout algorithms that are combinations of topic models and dimensionality reductions, and (3) quality metrics for quantifying the resulting layout. The corpora are given as document-term matrices, and each document is assigned to a thematic class. The chosen metrics quantify the preservation of local and global properties and the perceptual effectiveness of the two-dimensional scatter plots. By evaluating the benchmark on a computing cluster, we derived a multivariate dataset with over 45 000 individual layouts and corresponding quality metrics. Based on the results, we propose guidelines for the effective design of text spatializations that are based on topic models and dimensionality reductions. As a main result, we show that interpretable topic models are beneficial for capturing the structure of text corpora. We furthermore recommend the use of t-SNE as a subsequent dimensionality reduction.

Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

CAST: Corpus-Aware Self-similarity Enhanced Topic modelling

Topic Discovery in Massive Text Corpora Based on Min-Hashing

Parsimonious Topic Models with Salient Word Discovery

Topic Modeling Using Distributed Word Embeddings

Refine the Corpora Based on Document Manifold.

Topic Modeling over Short Texts by Incorporating Word Embeddings

"Draw My Topics": Find Desired Topics fast from large scale of Corpus

Analyses of Multi-collection Corpora via Compound Topic Modeling

Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text Spatialization

A Cluster Guided Topic Model for Social Query Expansion

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models

A Topic Model for Hierarchical Documents

Enhanced Short Text Modeling: Leveraging Large Language Models for Topic Refinement

Expansive data, extensive model: Investigating discussion topics around LLM through unsupervised machine learning in academic papers and news

Prompting Large Language Models for Topic Modeling

An NLP approach to quantify dynamic salience of predefined topics in a text corpus

Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs

Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling

Contextual-LDA: A Context Coherent Latent Topic Model for Mining Large Corpora.