Distributed adaptive lasso penalized generalized linear models for big data

Ye Fan,Suning Fan
DOI: https://doi.org/10.1080/03610918.2021.1888998
2021-03-04
Abstract:<span>Adaptive lasso penalized generalized linear models (GLMs) are a powerful tool for analyzing the high-dimensional sparse data where the classical linear or normal assumption is not met. In non-distributed environments, the estimation problem of adaptive lasso penalized GLMs is often solved by the coordinate descent based algorithm developed in Friedman, Hastie, and Tibshirani (<span class="ref-lnk"><a href="#">2010</a><span class="ref-overlay scrollable-ref rs_skip"> <span class="hlFld-ContribAuthor">Friedman, <span class="NLM_given-names">J.</span></span>, <span class="hlFld-ContribAuthor"><span class="NLM_given-names">T.</span> Hastie</span>, and <span class="hlFld-ContribAuthor"><span class="NLM_given-names">R.</span> Tibshirani</span>. <span class="NLM_year">2010</span>. <span class="NLM_article-title">Regularization Paths for Generalized Linear Models via Coordinate Descent</span>. <i>Journal of Statistical Software</i> 33 (1):<span class="NLM_fpage">1</span>–<span class="NLM_lpage">22</span>. doi:<span class="NLM_pub-id">10.18637/jss.v033.i01</span>.<span class="ref-links"><span class="xlinks-container"><a href="/servlet/linkout?suffix=CIT0008&amp;dbid=16&amp;doi=10.1080%2F03610918.2021.1888998&amp;key=10.18637%2Fjss.v033.i01">[Crossref]</a>, <a href="/servlet/linkout?suffix=CIT0008&amp;dbid=8&amp;doi=10.1080%2F03610918.2021.1888998&amp;key=20808728">[PubMed]</a>, <a href="/servlet/linkout?suffix=CIT0008&amp;dbid=128&amp;doi=10.1080%2F03610918.2021.1888998&amp;key=000275203200001">[Web of Science ®]</a></span> <span class="googleScholar-container">, <a class="google-scholar" href="http://scholar.google.com/scholar_lookup?hl=en&amp;volume=33&amp;publication_year=2010&amp;pages=1-22&amp;issue=1&amp;author=J.+Friedman&amp;author=T.+Hastie&amp;author=R.+Tibshirani&amp;title=Regularization+Paths+for+Generalized+Linear+Models+via+Coordinate+Descent&amp;doi=10.18637%2Fjss.v033.i01">[Google Scholar]</a></span></span></span></span>), which has been well implemented in the R package <span class="monospace">glmnet</span>. However, when applied to distributed big data, this algorithm is usually inflexible or even infeasible due to its non-parallel implementation, especially when the communication costs between the central and local machines are expensive, or the storage and computing capabilities of the central machine are insufficient. In this paper, we propose a new method, QAGLM-alasso, for the adaptive lasso penalized GLMs problem in distributed big data by applying the quadratic approximation representation of GLMs, and further develop a path-following algorithm for its estimation based on the Least Angle Regression (LARS). Theoretical analyses show that, under mild regularity conditions, the QAGLM-alasso enjoys the oracle property, and the obtained estimator is asymptotically equivalent to the original adaptive lasso. Simulation studies demonstrate that the new algorithm has similar estimation accuracy with <span class="monospace">glmnet</span>, but is significantly faster than <span class="monospace">glmnet</span> in distributed environments. We further illustrate the practical performance of the proposed method by analyzing a supersymmetric (SUSY) benchmark data set.</span>
What problem does this paper attempt to address?