Abstract:Summary: ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm. The package implements several recent proposed methods for inference on predicted data (IPD) with a single, user-friendly wrapper function, ipd. The package also provides custom print, summary, tidy, glance, and augment methods to facilitate easy model inspection. This document introduces the ipd software package and provides a demonstration of its basic usage. Availability: ipd is freely available on CRAN or as a developer version at our GitHub page: <a class="link-external link-http" href="http://github.com/ipd-tools/ipd" rel="external noopener nofollow">this http URL</a>. Full documentation, including detailed instructions and a usage `vignette' are available at <a class="link-external link-http" href="http://github.com/ipd-tools/ipd" rel="external noopener nofollow">this http URL</a>. Contact: jtleek@fredhutch.org and tylermc@uw.edu

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: when making statistical inferences using data predicted by artificial intelligence or machine - learning (AI/ML) algorithms, how to deal with potential bias and uncertainty issues. Specifically, the paper focuses on the following aspects: 1. **Understanding the relationship between prediction results and true unobserved results**: - When some data are predicted by AI/ML algorithms, the relationship between these predicted values and the true values may not be completely consistent. Therefore, it is necessary to understand the impact of this difference on subsequent statistical analysis. 2. **Quantifying the robustness of AI/ML models to the uncertainty of resampling or training data**: - The prediction results of AI/ML models may be affected by the training data and their distribution. To ensure the reliability of inferences, it is necessary to evaluate the performance of these models on different data sets and quantify their uncertainty. 3. **Correctly propagating the bias and uncertainty in upstream prediction models to the downstream inference process**: - When making statistical inferences based on predicted data, the errors and uncertainties in upstream prediction models must be properly transmitted to downstream analysis to avoid biased estimates and non - conservative inferences. To solve these problems, the paper introduces an R package named `ipd`, which implements several recently proposed methods for Inference on Predicted Data (IPD). These methods include, but are not limited to: - **PostPI (Post - prediction Inference)**: Proposed by Wang et al., used to correct inferences based on prediction results. - **PPI (Prediction - Powered Inference) and PPI++**: Proposed by Angelopoulos et al., aiming to improve the efficiency of inferences based on predicted data. - **PSPA (Post - prediction Adaptive Inference)**: Proposed by Miao et al., providing an adaptive inference method. - **PPBoot (Prediction - Powered Bootstrap)**: Proposed by Zrnic, using the Bootstrap method to evaluate the uncertainty of predicted data. - **Cross - PPI (Cross - Prediction - Powered Inference)**: Proposed by Zrnic and Candès, making inferences by combining the idea of cross - validation. Through these methods, the `ipd` package can help researchers make more accurate inferences and reduce potential bias and uncertainty when using predicted data for statistical analysis. ### Formula Representation The formulas involved in the paper are mainly used to describe linear regression models and the propagation of prediction errors. For example, in linear regression, assume the model is: \[ Y=\beta_1 X_1+\frac{1}{2} X_2^2+\frac{1}{3} X_3^3+\frac{1}{4} X_2^2+\epsilon \] where \(X_1, X_2, X_3, X_4 \sim N(0, 1)\), \(\beta_1 = 1\), \(\epsilon \sim N(0, \sigma_Y^2)\) and \(\sigma_Y = 4\). In addition, for each method, calculate the point estimate and the corresponding \(100(1 - \alpha)\%\) confidence interval, where \(\alpha = 0.05\). Through these methods and formulas, the `ipd` package provides comprehensive tools that enable researchers to make effective statistical inferences when dealing with predicted data.

ipd: An R Package for Conducting Inference on Predicted Data

AdaptiveConformal: An R Package for Adaptive Conformal Inference

iPat: intelligent prediction and association tool for genomic research

The R package predint: Prediction intervals for overdispersed binomial and Poisson data or based on linear random effects models in R

Investigating Data Usage for Inductive Conformal Predictors

pimeta: an R package of prediction intervals for random-effects meta-analysis

PPI++: Efficient Prediction-Powered Inference

aipred: A Flexible R Package Implementing Methods for Predicting Air Pollution.

IPAD: Stable Interpretable Forecasting with Knockoffs Inference

Dynamic Risk Prediction via a Joint Frailty-Copula Model and IPD Meta-Analysis: Building Web Applications

conformalInference.multi and conformalInference.fd: Twin Packages for Conformal Prediction

ivmodel: An R Package for Inference and Sensitivity Analysis of Instrumental Variables Models with One Endogenous Variable

Another look at inference after prediction

iPDP: On Partial Dependence Plots in Dynamic Modeling Scenarios

Bayesian Prediction-Powered Inference

pymdp: A Python library for active inference in discrete state spaces

DPpack: An R Package for Differentially Private Statistical Analysis and Machine Learning

BayesPPD: An R Package for Bayesian Sample Size Determination Using the Power and Normalized Power Prior for Generalized Linear Models

Assumption-Lean Post-Integrated Inference with Negative Control Outcomes

DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting

PredPsych: A toolbox for predictive machine learning-based approach in experimental psychology research