Abstract:Predictions under interventions are estimates of what a person's risk of an outcome would be if they were to follow a particular treatment strategy, given their individual characteristics. Such predictions can give important input to medical decision making. However, evaluating predictive performance of interventional predictions is challenging. Standard ways of evaluating predictive performance do not apply when using observational data, because prediction under interventions involves obtaining predictions of the outcome under conditions that are different to those that are observed for a subset of individuals in the validation dataset. This work describes methods for evaluating counterfactual performance of predictions under interventions for time-to-event outcomes. This means we aim to assess how well predictions would match the validation data if all individuals had followed the treatment strategy under which predictions are made. We focus on counterfactual performance evaluation using longitudinal observational data, and under treatment strategies that involve sustaining a particular treatment regime over time. We introduce an estimation approach using artificial censoring and inverse probability weighting which involves creating a validation dataset that mimics the treatment strategy under which predictions are made. We extend measures of calibration, discrimination (c-index and cumulative/dynamic AUCt) and overall prediction error (Brier score) to allow assessment of counterfactual performance. The methods are evaluated using a simulation study, including scenarios in which the methods should detect poor performance. Applying our methods in the context of liver transplantation shows that our procedure allows quantification of the performance of predictions supporting crucial decisions on organ allocation.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the problem of evaluating predictive performance under interventions. Specifically, researchers are concerned with how to use longitudinal observational data to assess the accuracy of individual risk prediction under specific treatment strategies. Standard methods for evaluating predictive performance are not applicable in this scenario because the conditions involved in the prediction are different from those actually observed in some individuals in the validation dataset. Therefore, the paper proposes a new method. By means of artificial censoring and inverse probability weighting techniques, a validation dataset simulating a specific treatment strategy is created, enabling the evaluation of the counterfactual performance of the prediction. ### Key point summary 1. **Background**: - Predicting an individual's risk under a specific treatment strategy is crucial for medical decision - making. - Standard prediction models cannot provide this information because they are usually based on the observed outcome distribution. - Longitudinal observational data (such as electronic health records) are the main data sources for developing these prediction models, but confounding factors need to be dealt with. 2. **Challenges**: - The counterfactual outcomes of individuals under different treatment strategies cannot be directly observed. - When using observational data to evaluate predictive performance, standard methods are not feasible because the prediction conditions are different from the actual observation conditions. 3. **Solutions**: - A method based on artificial censoring and inverse probability weighting is proposed to generate a validation dataset that simulates a specific treatment strategy. - Evaluation metrics for calibration, discrimination (such as the c - index and cumulative/dynamic AUCt), and overall prediction error (such as the Brier score) are extended to evaluate counterfactual performance. 4. **Methods**: - **Artificial censoring**: In the validation data, when an individual deviates from a specific treatment strategy, their follow - up time is censored. - **Inverse probability weighting**: Each individual is weighted so that they represent the situation where all individuals follow a specific treatment strategy. - **Performance evaluation**: Weighted Kaplan - Meier analysis and weighted c - index, AUCt, and Brier score are used to evaluate predictive performance. 5. **Application and validation**: - The effectiveness of the proposed method was verified through simulation studies. - It was applied to liver transplantation data to show how to evaluate predictive performance under different treatment strategies. ### Formula examples - **Inverse probability censoring weight (IPACW)**: \[ G^{-1}_{a_0}(t|\mathbf{L})=\prod_{s = 0}^{\lfloor t\rfloor}\left(\frac{1}{\Pr(A_s=a_s|\bar{A}_{s - 1}=\bar{a}_{s - 1},\bar{L}_s)}\right) \] - **Weighted Brier score**: \[ \hat{BS}_{a_0}(t)=\frac{1}{n}\sum_{i = 1}^{n}\left(I(\tilde{T}_{a_0i}\leq t)-\hat{R}_{a_0i}(t|\mathbf{X}_i)\right)^2W^{(2)}_{a_0i} \] where, \[ W^{(2)}_{a_0i}=\frac{I(\tilde{T}_{a_0i}\leq t,\tilde{D}_{a_0i}=1)}{\hat{G}_{a_0c}(\tilde{T}_{a_0i}|\mathbf{L}_i)}+\frac{I(\tilde{T}_{a_0i}>t)}{\hat{G}_{a_0c}(t|\mathbf{L}_i)} \] ### Conclusion The paper proposes a new method that can evaluate the predictive performance under specific treatment strategies in longitudinal observational data. This method creates a validation dataset that simulates a specific treatment strategy through artificial censoring and inverse probability weighting techniques and extends existing performance evaluation metrics. Through simulation studies and practical applications, the effectiveness and practicality of this method are proven. This provides important support for medical decision - making, especially in...

Prediction under interventions: evaluation of counterfactual performance using longitudinal observational data

Causal Inference and Counterfactual Prediction in Machine Learning for Actionable Healthcare

Risk‐Based Decision Making: Estimands for Sequential Prediction Under Interventions

Predicting Counterfactuals from Large Historical Data and Small Randomized Trials

Leveraging Clinical Time-Series Data for Prediction: A Cautionary Tale

Prediction meets causal inference: the role of treatment in clinical prediction models

Counterfactual Prediction Under Outcome Measurement Error

Catalytic asymmetric nitroso-Diels-Alder reaction with acyclic dienes.

Counterfactual Prediction for Outcome-oriented Treatments

A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictions

A Threshold-free Prospective Prediction Accuracy Measure for Censored Time to Event Data

Evaluation of adaptive treatment strategies in an observational study where time-varying covariates are not monitored systematically

Conformal Counterfactual Inference under Hidden Confounding

Hyponatremia to be an excellent predictor of outcome in patients with advanced cirrhosis.

Performance and Application of Estimators for the Value of an Optimal Dynamic Treatment Rule

A tutorial on evaluating time-varying discrimination accuracy for survival models used in dynamic decision-making

DeepRite: Deep Recurrent Inverse TreatmEnt Weighting for Adjusting Time-varying Confounding in Modern Longitudinal Observational Data

Outcome Prediction in Clinical Treatment Processes

Unmasking Bias: A Framework for Evaluating Treatment Benefit Predictors Using Observational Studies

Uncertainty-Aware Optimal Treatment Selection for Clinical Time Series

When impact trials are not feasible: alternatives to study the impact of prediction models on clinical practice