Abstract:AbstractWe present LDA*, a system that has been deployed in one of the largest Internet companies to fulfil their requirements of "topic modeling as an internal service"---relying on thousands of machines, engineers in different sectors submit their data, some are as large as 1.8TB, to LDA* and get results back in hours. LDA* is motivated by the observation that none of the existing topic modeling systems is robust enough---Each of these existing systems is designed for a specific point in the tradeoff space that can be sub-optimal, sometimes by up to 10×, across workloads.Our first contribution is a systematic study of all recently proposed samplers: AliasLDA, F+LDA, LightLDA, and WarpLDA. We discovered a novel system tradeoff among these samplers. Each sampler has different sampling complexity and performs differently, sometimes by 5×, on documents with different lengths. Based on this tradeoff, we further developed a hybrid sampler that uses different samplers for different types of documents. This hybrid approach works across a wide range of workloads and outperforms the fastest sampler by up to 2x. We then focused on distributed environments in which thousands of workers, each with different performance (due to virtualization and resource sharing), coordinate to train a topic model. Our second contribution is an asymmetric parameter server architecture that pushes some computation to the parameter server side. This architecture is motivated by the skew of the word frequency distribution and a novel tradeoff we discovered between communication and computation. With this architecture, we outperform the traditional, symmetric architecture by up to 2×.With these two contributions, together with a carefully engineered implementation, our system is able to outperform existing systems by up to 10× and has already been running to provide topic modeling services for more than six months.

Predicting Good Configurations for GitHub and Stack Overflow Topic Models

A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects

Topic modeling, long texts and the best number of topics. Some Problems and solutions

What Do Programmers Discuss about Deep Learning Frameworks

An Empirical Study on Developer Interactions in StackOverflow.

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Expansive data, extensive model: Investigating discussion topics around LLM through unsupervised machine learning in academic papers and news

How well do pre-trained contextual language representations recommend labels for GitHub issues?

Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling

Topic Modelling of Empirical Text Corpora: Validity, Reliability, and Reproducibility in Comparison to Semantic Maps.

ACT-LDA: A Probabilistic Model of Topic, Community and User Influence

LDA*: A Robust and Large-scale Topic Modeling System.

Generative AI for automatic topic labelling

Contextual-LDA: A Context Coherent Latent Topic Model for Mining Large Corpora.

Source-LDA: Enhancing probabilistic topic models using prior knowledge sources

On the effectiveness of Large Language Models for GitHub Workflows

LEGION: Harnessing Pre-trained Language Models for GitHub Topic Recommendations with Distribution-Balance Loss

Joint Modeling of Topics, Citations, and Topical Authority in Academic Corpora

LDAExplore: Visualizing Topic Models Generated Using Latent Dirichlet Allocation

LDAPrototype: a model selection algorithm to improve reliability of latent Dirichlet allocation

Developer Challenges on Large Language Models: A Study of Stack Overflow and OpenAI Developer Forum Posts