Abstract:AbstractWe present LDA*, a system that has been deployed in one of the largest Internet companies to fulfil their requirements of "topic modeling as an internal service"---relying on thousands of machines, engineers in different sectors submit their data, some are as large as 1.8TB, to LDA* and get results back in hours. LDA* is motivated by the observation that none of the existing topic modeling systems is robust enough---Each of these existing systems is designed for a specific point in the tradeoff space that can be sub-optimal, sometimes by up to 10×, across workloads.Our first contribution is a systematic study of all recently proposed samplers: AliasLDA, F+LDA, LightLDA, and WarpLDA. We discovered a novel system tradeoff among these samplers. Each sampler has different sampling complexity and performs differently, sometimes by 5×, on documents with different lengths. Based on this tradeoff, we further developed a hybrid sampler that uses different samplers for different types of documents. This hybrid approach works across a wide range of workloads and outperforms the fastest sampler by up to 2x. We then focused on distributed environments in which thousands of workers, each with different performance (due to virtualization and resource sharing), coordinate to train a topic model. Our second contribution is an asymmetric parameter server architecture that pushes some computation to the parameter server side. This architecture is motivated by the skew of the word frequency distribution and a novel tradeoff we discovered between communication and computation. With this architecture, we outperform the traditional, symmetric architecture by up to 2×.With these two contributions, together with a carefully engineered implementation, our system is able to outperform existing systems by up to 10× and has already been running to provide topic modeling services for more than six months.

High Performance LDA Through Collective Model Communication Optimization

HarpLDA+: Optimizing Latent Dirichlet Allocation for Parallel Efficiency

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Online Learning Algorithm for Collective Lda

A Comparative Study on Parallel Lda Algorithms in Mapreduce Framework

DiLoCo: Distributed Low-Communication Training of Language Models

ZenLDA: Large-scale Topic Model Training on Distributed Data-Parallel Platform

High-Speed Data Communication with Advanced Networks in Large Language Model Training

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

ZenLDA: An Efficient and Scalable Topic Model Training System on Distributed Data-Parallel Platform

LDA*: A Robust and Large-scale Topic Modeling System.

Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

On Optimizing the Communication of Model Parallelism

LightLDA: Big Topic Models on Modest Compute Clusters

Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution

A Survey of Methods for Collective Communication Optimization and Tuning

HCEC: An efficient geo-distributed deep learning training strategy based on wait-free back-propagation