Distributed Subdata Selection for Big Data Via Sampling-Based Approach

Haixiang Zhang,HaiYing Wang
DOI: https://doi.org/10.1016/j.csda.2020.107072
IF: 2.035
2021-01-01
Computational Statistics & Data Analysis
Abstract:With the development of modern technologies, it is possible to gather an extraordinarily large number of observations. Due to the storage or transmission burden, big data are usually scattered at multiple locations. It is difficult to transfer all of data to the central server for analysis. A distributed subdata selection method for big data linear regression model is proposed. Particularly, a two-step subsampling strategy with optimal subsampling probabilities and optimal allocation sizes is developed. The subsample-based estimator effectively approximates the ordinary least squares estimator from the full data. The convergence rate and asymptotic normality of the proposed estimator are established. Simulation studies and an illustrative example about airline data are provided to assess the performance of the proposed method.
What problem does this paper attempt to address?