Optimal Subsample Selection for Massive Logistic Regression with Distributed Data

Zuo Lulu,Zhang Haixiang,Wang HaiYing,Sun Liuquan
DOI: https://doi.org/10.1007/s00180-021-01089-0
IF: 1.4049
2021-01-01
Computational Statistics
Abstract:With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
What problem does this paper attempt to address?