A Comparative Study on Parallel Lda Algorithms in Mapreduce Framework

Yang Gao,Zhenlong Sun,Yi Wang,Xiaosheng Liu,Jianfeng Yan,Jia Zeng
DOI: https://doi.org/10.1007/978-3-319-18032-8_53
2015-01-01
Abstract:Although several parallel latent Dirichlet allocation (LDA) algorithms have been implemented to extract topic features from large-scale text data sets, very few studies compare their performance in real-world industrial applications. In this paper, we build a novel multi-channel MapReduce framework to compare fairly three representative parallel LDA algorithms such as parallel variational Bayes (PVB), parallel Gibbs sampling (PGS) and parallel belief propagation (PBP). Experimental results confirm that PGS yields the best application performance in search engine and online advertising system of Tencent, one of the biggest Internet companies in China, while PBP has the highest topic modeling accuracy. Moreover, PGS is more scalable in MapReduce framework than PVB and PBP because of its low memory usage and efficient sampling technique.
What problem does this paper attempt to address?