Communication-efficient Estimation for Distributed Subset Selection

Yan Chen,Ruipeng Dong,Canhong Wen
DOI: https://doi.org/10.1007/s11222-023-10302-7
IF: 2.3241
2023-01-01
Statistics and Computing
Abstract:Due to the large scale both of the sample size and dimensions, modern data is usually stored in a distributed system, which poses unprecedented challenges in computation and statistical inference. Best subset selection is widely known as a benchmark method for handling high-dimensional data. However, there still is a lack of the study of the efficient algorithm for the best subset selection in the distributed system. To this end, we propose a new communication-efficient method to deal with the best subset selection in the distributed system. The proposed method restricts the information communication among local machines in a moderate active set, and leads not only to an efficient computation but also a cheaper cost of communication in a network of the distributed system. Moreover, we propose a new generalized information criterion for tuning the sparsity level on the central machine. Under mild conditions, we establish the consistency of estimation and variable selection for the proposed estimator. We demonstrate the superiority of the proposed method through several numerical studies and a real data application in adolescent health.
What problem does this paper attempt to address?