Optimizing Generalized Linear Models with Billions of Variables.
Yanbo Liang,Yongyang Yu,MingJie Tang,Chaozhuo Li,Weiqing Yang,Weichen Xu,Ruifeng Zheng
DOI: https://doi.org/10.1145/3269206.3272014
2018-01-01
Abstract:The use of large-scale machine learning~(ML) is becoming ubiquitous in various domains ranging from business intelligence to self-driving cars. Many companies are building ML pipelines in a unified data processing environment, and leveraging well-tuned numerical optimization packages for obtaining model parameters. However, most existing optimization tools are specifically designed for a single machine setup, and cannot handle vast volume of data. In this work, we build a distributed computing framework towards optimizing generalized linear models with billions of variables. We at first design a new distributed vector to represent data points from extremely large feature space. Then, we introduce an efficient and scalable approach to compute the second order derivatives of loss function, and optimizes model parameters with limited memory requirement. Experiments on real-world datasets demonstrate that our proposed techniques can scale up for ML models with billions of variables, and achieves better performance than state-of-the-art systems on a wide range of applications, e.g., ad CTR prediction and rideshare price bidding.