U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks

Zhe Fei,Yi Li
2024-07-22
Abstract:Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hájek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.
Machine Learning,Statistics Theory,Quantitative Methods
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of accurately assessing predictive uncertainty in high-dimensional data predictions, especially when using complex machine learning models such as Lasso regression and Deep Neural Networks (DNN). Specifically, the research concentrates on estimating an individual's epigenetic age, namely DNA methylation age (DNAmAge), by analyzing the DNA methylation patterns of multiple CpG sites in the genome. As traditional asymptotic methods may not be applicable, the authors propose a U-learning method based on Combinatory Multi-Subsampling (CMS) to construct confidence intervals for continuous outcome predictions. The core contributions of the paper include: 1. **U-learning method**: By conceptualizing the predictor as a generalized U-statistic and combining it with Hájek projection to estimate the variance of prediction errors, this method constructs probabilistic prediction intervals with effective conditional coverage. This approach is not only applicable to Lasso regression but also to DNN, providing model-independent variance estimation with minimal assumptions. 2. **Theoretical guarantees**: The paper provides theoretical proofs of consistency and asymptotic normality for U-learning predictions, as well as consistency proofs for the variance estimates used to construct confidence intervals. This lays the foundation for effective statistical inference from predictions. 3. **Application examples**: The study demonstrates how the proposed U-learning method can be applied to predict the DNA methylation age of patients under different health conditions, with the goal of accurately describing the aging process and potentially guiding anti-aging interventions. 4. **Numerical studies**: The effectiveness of the proposed method is validated through extensive numerical studies, particularly in the context of Lasso and DNN, proving the efficacy and coverage probability of the prediction intervals. In summary, the paper aims to tackle the problem of predictive uncertainty with high-dimensional input data, especially in the field of epigenetics, by proposing a novel U-learning framework that can handle the predictive uncertainty of complex machine learning models, thereby providing accurate prediction intervals at the individual level.