Abstract:When building large-scale machine learning (ML) programs, such as big topic models or deep neural nets, one usually assumes such tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude faster than current state-of-the-art Gibbs samplers; 2) a structure-aware model-parallel scheme, which leverages dependencies within the topic model, yielding a sampling strategy that is frugal on machine memory and network communication; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed; and 4) a bounded asynchronous data-parallel scheme, which allows efficient distributed processing of massive data via a parameter server. Our distribution strategy is an instance of the model-and-data-parallel programming model underlying the Petuum framework for general distributed ML, and was implemented on top of the Petuum open-source system. We provide experimental evidence showing how this development puts massive models within reach on a small cluster while still enjoying proportional time cost reductions with increasing cluster size, in comparison with alternative options.

Parallel Topic Model and Its Application on Document Clustering.

Document Clustering Based on Probabilistic Topic Model

Distributed Affinity Propagation Clustering Based on MapReduce

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

ZenLDA: An Efficient and Scalable Topic Model Training System on Distributed Data-Parallel Platform

Model-Parallel Inference for Big Topic Models

LightLDA: Big Topic Models on Modest Compute Clusters

ZenLDA: Large-scale Topic Model Training on Distributed Data-Parallel Platform

Parallel Massive Clustering of Discrete Distributions

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling

A Comparative Study on Parallel Lda Algorithms in Mapreduce Framework

Parallelization and Characterization of Probabilistic Latent Semantic Analysis

Large scale microblog mining using distributed MB-LDA.

Parallel dynamic topic modeling via evolving topic adjustment and term weighting scheme

Distributing the Stochastic Gradient Sampler for Large-Scale LDA.

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Scalable Training of Hierarchical Topic Models

Topic Detection from Conversational Dialogue Corpus with Parallel Dirichlet Allocation Model and Elbow Method

PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

Peacock: Learning Long-Tail Topic Features for Industrial Applications