Distributing the Stochastic Gradient Sampler for Large-Scale LDA.

Yuan Yang,Jianfei Chen,Jun Zhu
DOI: https://doi.org/10.1145/2939672.2939821
2016-01-01
Abstract:Learning large-scale Latent Dirichlet Allocation (LDA) models is beneficial for many applications that involve large collections of documents.Recent work has been focusing on developing distributed algorithms in the batch setting, while leaving stochastic methods behind, which can effectively explore statistical redundancy in big data and thereby are complementary to distributed computing.The distributed stochastic gradient Langevin dynamics (DSGLD) represents one attempt to combine stochastic sampling and distributed computing, but it suffers from drawbacks such as excessive communications and sensitivity to partitioning of datasets across nodes. DSGLD is typically limited to learn small models that have about 103 topics and $10^3$ vocabulary size. In this paper, we present embarrassingly parallel SGLD (EPSGLD), a novel distributed stochastic gradient sampling method for topic models. Our sampler is built upon a divide-and-conquer architecture which enables us to produce robust and asymptotically exact samples with less communication overhead than DSGLD. We further propose several techniques to reduce the overhead in I/O and memory usage. Experiments on Wikipedia and ClueWeb12 documents demonstrate that, EPSGLD can scale up to large models with 1010 parameters (i.e., 105 topics, 105 vocabulary size), four orders of magnitude larger than DSGLD, and converge faster.
What problem does this paper attempt to address?