Abstract:Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most cases, they do not have powerful computational resources (e.g., Hadoop or Spark). How to practically analyze large datasets with limited computational resources then becomes a problem of great importance. To solve this problem, we propose here a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample data as if they were the population. Then, multiple subsamples with greatly reduced sizes are obtained by the method of simple random sampling with replacement. It is remarkable that we do not recommend sampling methods without replacement because this would incur a significant cost for data processing on the hard drive. Such cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory as a whole and then processed easily. Based on subsampled datasets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged together to form the final estimator. We theoretically show that the final estimator is consistent and asymptotically normal. Its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is simple enough to be easily implemented on most practical computer systems and thus should have very wide applicability.

Optimal Subsampling Approaches for Large Sample Linear Regression

Optimal Subsampling for Large Sample Logistic Regression

Optimal Subsampling Algorithms for Big Data Regressions

Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model

Subsampling for Big Data Linear Models with Measurement Errors

Optimal Subsampling for Large-Scale Quantile Regression

Optimal Subsampling Algorithms for Big Data Generalized Linear Models

Optimal Subsampling Algorithms for Big Data Generalized Linear Models.

Information-Based Optimal Subdata Selection for Big Data Linear Regression

Robust and efficient subsampling algorithms for massive data logistic regression

Optimal Subsampling for Large Sample Ridge Regression

Orthogonal Subsampling for Big Data Linear Regression

Subsampled Optimization: Statistical Guarantees, Mean Squared Error Approximation, and Sampling Method

Optimal subsampling algorithm for the marginal model with large longitudinal data

Optimal subsampling for quantile regression in big data

Optimal subsampling for functional composite quantile regression in massive data

Optimal subsampling designs

Estimation and testing of expectile regression with efficient subsampling for massive data

Optimal subsampling algorithms for composite quantile regression in massive data

Sample Weighting: an Inherent Approach for Outlier Suppressing Discriminant Analysis

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources