Supervised Stratified Subsampling for Predictive Analytics

Ming-Chung Chang
DOI: https://doi.org/10.1080/10618600.2024.2304075
2024-02-15
Journal of Computational and Graphical Statistics
Abstract:Predictive analytics involves the use of statistical models to make predictions; however, the power of these techniques is hindered by ever-increasing quantities of data. The richness and sheer volume of big data can have a profound effect on computation time and/or numerical stability. In the current study, we develop a novel approach to subsampling with the aim of overcoming this issue when dealing with regression problems in a supervised learning framework. The proposed method integrates stratified sampling and is model-independent. We assess the theoretical underpinnings of the proposed subsampling scheme, and demonstrate its efficacy in yielding reliable predictions with desirable robustness when applied to different statistical models. Supplementary materials for this article are available online.
statistics & probability
What problem does this paper attempt to address?