MLlib*: Fast Training of GLMs Using Spark MLlib

Zhipeng Zhang,Jiawei Jiang,Wentao Wu,Ce Zhang,Lele Yu,Bin Cui
DOI: https://doi.org/10.1109/ICDE.2019.00194
2019-01-01
Abstract:In Tencent Inc., more than 80% of the data are extracted and transformed using Spark. However, the commonly used machine learning systems are TensorFlow, XGBoost, and Angel, whereas Spark MLlib, an official Spark package for machine learning, is seldom used. One reason for this ignorance is that it is generally believed that Spark is slow when it comes to distributed machine learning. Users therefore have to undergo the painful procedure of moving data in and out of Spark. The question why Spark is slow, however, remains elusive. In this paper, we study the performance of MLlib with a focus on training generalized linear models using gradient descent. Based on a detailed examination, we identify two bottlenecks in MLlib, i.e., pattern of model update and pattern of communication. To address these two bottlenecks, we tweak the implementation of MLlib with two state-of-the-art and well-known techniques, model averaging and AllReduce. We show that, the new system that we call MLlib*, can significantly improve over MLlib and achieve similar or even better performance than other specialized distributed machine learning systems (such as Petuum and Angel), on both public and Tencent's workloads.
What problem does this paper attempt to address?