ipd: An R Package for Conducting Inference on Predicted Data

Stephen Salerno,Jiacheng Miao,Awan Afiaz,Kentaro Hoffman,Anna Neufeld,Qiongshi Lu,Tyler H. McCormick,Jeffrey T. Leek
DOI: https://doi.org/10.48550/arXiv.2410.09665
2024-10-13
Abstract:Summary: ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm. The package implements several recent proposed methods for inference on predicted data (IPD) with a single, user-friendly wrapper function, ipd. The package also provides custom print, summary, tidy, glance, and augment methods to facilitate easy model inspection. This document introduces the ipd software package and provides a demonstration of its basic usage. Availability: ipd is freely available on CRAN or as a developer version at our GitHub page: <a class="link-external link-http" href="http://github.com/ipd-tools/ipd" rel="external noopener nofollow">this http URL</a>. Full documentation, including detailed instructions and a usage `vignette' are available at <a class="link-external link-http" href="http://github.com/ipd-tools/ipd" rel="external noopener nofollow">this http URL</a>. Contact: jtleek@fredhutch.org and tylermc@uw.edu
Methodology,Computation
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: when making statistical inferences using data predicted by artificial intelligence or machine - learning (AI/ML) algorithms, how to deal with potential bias and uncertainty issues. Specifically, the paper focuses on the following aspects: 1. **Understanding the relationship between prediction results and true unobserved results**: - When some data are predicted by AI/ML algorithms, the relationship between these predicted values and the true values may not be completely consistent. Therefore, it is necessary to understand the impact of this difference on subsequent statistical analysis. 2. **Quantifying the robustness of AI/ML models to the uncertainty of resampling or training data**: - The prediction results of AI/ML models may be affected by the training data and their distribution. To ensure the reliability of inferences, it is necessary to evaluate the performance of these models on different data sets and quantify their uncertainty. 3. **Correctly propagating the bias and uncertainty in upstream prediction models to the downstream inference process**: - When making statistical inferences based on predicted data, the errors and uncertainties in upstream prediction models must be properly transmitted to downstream analysis to avoid biased estimates and non - conservative inferences. To solve these problems, the paper introduces an R package named `ipd`, which implements several recently proposed methods for Inference on Predicted Data (IPD). These methods include, but are not limited to: - **PostPI (Post - prediction Inference)**: Proposed by Wang et al., used to correct inferences based on prediction results. - **PPI (Prediction - Powered Inference) and PPI++**: Proposed by Angelopoulos et al., aiming to improve the efficiency of inferences based on predicted data. - **PSPA (Post - prediction Adaptive Inference)**: Proposed by Miao et al., providing an adaptive inference method. - **PPBoot (Prediction - Powered Bootstrap)**: Proposed by Zrnic, using the Bootstrap method to evaluate the uncertainty of predicted data. - **Cross - PPI (Cross - Prediction - Powered Inference)**: Proposed by Zrnic and Candès, making inferences by combining the idea of cross - validation. Through these methods, the `ipd` package can help researchers make more accurate inferences and reduce potential bias and uncertainty when using predicted data for statistical analysis. ### Formula Representation The formulas involved in the paper are mainly used to describe linear regression models and the propagation of prediction errors. For example, in linear regression, assume the model is: \[ Y=\beta_1 X_1+\frac{1}{2} X_2^2+\frac{1}{3} X_3^3+\frac{1}{4} X_2^2+\epsilon \] where \(X_1, X_2, X_3, X_4 \sim N(0, 1)\), \(\beta_1 = 1\), \(\epsilon \sim N(0, \sigma_Y^2)\) and \(\sigma_Y = 4\). In addition, for each method, calculate the point estimate and the corresponding \(100(1 - \alpha)\%\) confidence interval, where \(\alpha = 0.05\). Through these methods and formulas, the `ipd` package provides comprehensive tools that enable researchers to make effective statistical inferences when dealing with predicted data.